# McGinty and Sunyaev, Supplementary Code
## Table of contents <a name="TOC"></a>
1. [Process sample dataset or full dataset?](#sample)
2. [Load libraries and define functions which will be used throughout the analysis](#libraries)
    1. [Load python libraries](#libraries1)
    2. [Define functions](#functions)
3. [Process hg38 genome fasta files](#hg38)
    1. [Download files](#hg38download)
    2. [Read files](#hg38read)
    3. [Count trinucleotides](#hg38count)
4. [Use Repeatmasker to remove transposable elements from hg38](#RM)
    1. [Download files](#RMdownload)  
    2. [Read and format files](#RMread)
    3. [Count trinucleotides](#RMcount)  
    4. [Modify genome fastas to remove Repeatmasker regions](#RMmodify)  
5. [Generate a database of repeat motifs in hg38](#DB)
    1. [Short Tandem Repeats](#DB_STR)  
        1. [Define search features](#DB_STR_define)
            1. [All possible N-mers](#DB_STR_define_allNmers)  
            2. [Reduce to nonredundant N-mers](#DB_STR_define_nonredundant)  
            3. [Find high-N-mers that repeat at least once](#DB_STR_define_reduce)  
        1. [Search, including overlaps](#DB_STR_search)
        2. [Expand by full repeat unit](#DB_STR_expand_full)
        3. [Expand by partial repeat units to get longest perfect STRs](#DB_STR_expand_partial)
        4. [Find imperfect STRs](#DB_STR_imperfect)
            1. [Overlap coordinates](#DB_STR_imperfect_overlap)
            2. [Find location of interruptions within each STR](#DB_STR_imperfect_location)
            3. [Separate perfect vs imperfect STRs](#DB_STR_imperfect_separate)
            4. [Reversion predictions for in-frame interruptions](#DB_STR_imperfect_predict)
            5. [Save/load STR database](#DB_STR_imperfect_save)
        5. [Mask reference genome with STRs](#DB_STR_mask)
    2. [Inverted Repeats](#DB_IR)  
        1. [Find all 5-mer IRs](#DB_IR_5mer)
        2. [Expand perfect IRs](#DB_IR_expand)
        3. [Expand imperfect IRs](#DB_IR_expand_imperfect)
    3. [Mirror Repeats](#DB_MR)  
        1. [Find all 5-mer MRs](#DB_MR_5mer)
        2. [Expand perfect MRs](#DB_MR_expand)
        3. [Expand imperfect MRs](#DB_MR_expand_imperfect)
    4. [Direct Repeats](#DB_DR)  
        1. [Find perfect DRs](#DB_DR_perfect)
        2. [Expand imperfect DRs](#DB_DR_imperfect)   
    5. [Z-DNA Motifs](#DB_ZDNA)  
        1. [Version 1: RY without AT](#DB_ZDNA_v1)
        2. [Version 1: GY](#DB_ZDNA_v2)
    6. [G4 Motifs](#DB_G4)  
        1. [Leftover from hg19 to hg38](#DB_G4_liftover)
        2. [Combine K+ and PDS G4-seq conditions](#DB_G4_combine)
        3. [Check for overlap between PDS and K+ G4-seq conditions](#DB_G4_overlaps)
        4. [Confirm strand orientation and locate motifs](#DB_G4_motifs)
        5. [Save/load G4 database](#DB_G4_save)
    7. [Combine all motifs into single database](#DB_all)
        1. [Calculate distance to motifs and transposons, then filter database to unique motifs](#DB_all_distance)
        2. [Motif type overlaps](#DB_all_overlaps)
            1. [Plot for Supplementary Figures S1A and S1B](#DB_nonbdb_plot_S1A)
    8. [Random sequences](#DB_random)
    9. [Non-B DB](#DB_nonbdb)
        1. [Modify and annotate Non-B DB](#DB_nonbdb_modify)
            1. [Load and format](#DB_nonbdb_load)
            2. [Add in G4 motifs from G4-seq](#DB_nonbdb_G4)
            3. [Add in random sequences](#DB_nonbdb_random)
            4. [Filter database by proximity (or don't)](#DB_nonbdb_filter)
        2. [Plot overlaps between Non-B DB categories](#DB_nonbdb_plot)
            1. [Plot for Supplementary Figure S1c](#DB_nonbdb_plot_S1C)
            2. [Plot for Supplementary Figure S1d](#DB_nonbdb_plot_S1D)        
6. [Prepare gnomAD database](#GNOMAD)
    1. [Read gnomAD file in chunks](#GNOMAD_shrink_read)
    2. [Combine chunks into .csv file](#GNOMAD_shrink_combine)
    3. [Filter to rare SNVs and count GC content](#GNOMAD_rare)
    4. [Load gnomAD SNV database and calculate trinucleotide mutation frequency](#GNOMAD_freq)
        1. [Generate downsampled gnomAD files based on AC](#GNOMAD_downsample_generate)
        2. [Correct allele count to reflect number of independent mutations](#GNOMAD_AC_correction)
        3. [Total count and frequency using allele counts](#GNOMAD_freq_AC)
        4. [Transition/transversion ratio](#GNOMAD_freq_tstv)
    5. [Calculate GC content correction factor](#GNOMAD_GC_correction)
    6. [Downsampling gnomAD files](#GNOMAD_downsample)
    7. [gnomAD indels](#GNOMAD_indels)
        1. [Make database](#GNOMAD_indels_makedb)
        2. [Load databse and count genome totals](#GNOMAD_indels_load)
7. [Prepare de novo SNV database](#denovo)
    1. [Studies aligned to hg19](#denovo_hg19)
    2. [Studies aligned to hg38](#denovo_hg38)
    3. [Combine all studies](#denovo_combine)
    4. [Load de novo SNV database](#denovo_load)
8. [Calculate mutation frequency surrounding motifs](#mutation_surrounding)
    1. [Define counting functions](#mutation_surrounding_functions)
    2. [Count and analyze flanking mutations and trinucleotides](#mutation_surrounding_analysis)
        1. [Mutation frequency surrounding random non-motif sequences](#mutation_surrounding_analysis_random)
        2. [Mutation frequency surrounding repeat motif sequences](#mutation_surrounding_analysis_repeat)
            1. [Load database and define motif categories](#mutation_surrounding_analysis_categories)
            2. [Mutation frequency surrounding STRs](#mutation_surrounding_analysis_str)
                1. [CG repeats in/outside of CpG islands](#mutation_surrounding_CGI)
            3. [Mutation frequency surrounding IRs](#mutation_surrounding_analysis_ir)
            4. [Mutation frequency surrounding MRs](#mutation_surrounding_analysis_mr)
            5. [Mutation frequency surrounding DRs](#mutation_surrounding_analysis_dr)
            6. [Mutation frequency surrounding Z-DNA](#mutation_surrounding_analysis_zdna)
            7. [Mutation frequency surrounding G4 motifs](#mutation_surrounding_analysis_g4)
            8. [Save/load mutation counts](#mutation_surrounding_analysis_saveload)
        3. [Count de novo mutations surrounding motifs](#mutation_surrounding_denovo)
    3. [Improper analysis of NonB Database (on purpose)](#mutation_surrounding_nonbdb)
        1. [Count and analyze Non-B DB flanking mutations](#mutation_surrounding_nonbdb_count)
            1. [Mutation frequency surrounding STRs](#mutation_surrounding_nonbdb_str)
            2. [Mutation frequency surrounding IRs](#mutation_surrounding_nonbdb_ir)
            3. [Mutation frequency surrounding MRs](#mutation_surrounding_nonbdb_mr)
            4. [Mutation frequency surrounding DRs](#mutation_surrounding_nonbdb_dr)
            5. [Mutation frequency surrounding Z-DNA](#mutation_surrounding_nonbdb_zdna)
            6. [Mutation frequency surrounding G4 motifs](#mutation_surrounding_nonbdb_g4)
            7. [Save/load mutation counts](#mutation_surrounding_nonbdb_saveload)
    4. [Count flanking mutations using subsampled gnomAD](#mutation_surrounding_subsample)          
    5. [Plot flanking mutation frequencies](#mutation_surrounding_plot)
        1. [Plot individual motifs/categories](#mutation_surrounding_plot_ind)
        2. [Plot for Figure 2](#mutation_surrounding_plot_fig2)
        3. [Plot for Supplementary Figure S2a](#mutation_surrounding_plot_figS2A)
        4. [Plot for Supplementary Figure S2b](#mutation_surrounding_plot_figS2B)
        5. [Plot for Supplementary Figure S2c](#mutation_surrounding_plot_figS2B)
    6. [CG repeats in/outside of CpG islands](#mutation_surrounding_CGI)
        1. [Download CpG island map](#mutation_surrounding_CGI_download)
        2. [Measure distance between CG motifs and CpG islands](#mutation_surrounding_CGI_distance)
        3. [Analyze flanking mutations](#mutation_surrounding_CGI_analyze)
        4. [Plot for Supplementary Figure S2d](#mutation_surrounding_CGI_plot)
    7. [Flanking mutation frequencies after gnomAD subsampling](#mutation_surrounding_subsample)
        1. [Plot for Supplementary Figure S3](#mutation_surrounding_subsample_plot)
    8. [Flanking mutation frequencies by mutation type](#mutation_surrounding_bymut)
        1. [Plot for Supplementary Figure S4](#mutation_surrounding_bymut_plot)
9. [Indel analysis for flanking regions](#indel_analysis_flank)
    1. [Count indels flanking motifs](#indel_analysis_flank_count)
    2. [Save/load counts](#indel_analysis_flank_load)
    3. [Plot indels flanking motifs](#indel_analysis_flank_plot)
        1. [Plot for Supplementary Figure S2h](#indel_analysis_flank_plot_S2H)
10. [Calculate mutation frequencies within motifs](#mutation_internal)
    1. [Annotate positions within motifs](#mutation_internal_annotate)
        1. [Locate interruptions within motifs](#mutation_internal_annotate_interruptions)
        2. [Annotate other positions](#mutation_internal_annotate_other)
        3. [Combine all positions into database](#mutation_internal_annotate_combine)
    2. [Calculate mutation frequency at specified internal positions](#mutation_internal_count)
        1. [Define functions for analysis](#mutation_internal_count_functions)
        2. [Count and analyze](#mutation_internal_count_analyze)
            1. [Inverted/Mirror/Direct motifs](#mutation_internal_count_analyze_IRDMRDR)
            2. [Z-DNA motifs](#mutation_internal_count_analyze_ZDNA)
            3. [STR motifs](#mutation_internal_count_analyze_STR)
            4. [G4 motifs](#mutation_internal_count_analyze_G4)
    3. [Plot mutation frequency within motifs](#mutation_internal_count_plot)
        1. [Plot for Figure 3a](#mutation_internal_count_plot_3A)
        2. [Plot for Figure S3b](#mutation_internal_count_plot_S3B)
        3. [Plot for Supplementary Figure S5a](#mutation_internal_count_plot_S5A)
        4. [Plot for Supplementary Figure S5c](#mutation_internal_count_plot_S5C)
        5. [Plot for Supplementary Figures S4a, S6a](#mutation_internal_count_plot_S6A)
        6. [Plot for Figure 5a](#mutation_internal_count_plot_5A)
    4. [Indels within motifs](#mutation_internal_indels)
        1. [Define functions for analysis](#mutation_internal_indels_functions)
        2. [Counting and analysis of STR/DR/MR/IR/ZDNA](#mutation_internal_indels_analysis)
        3. [Counting and analysis of G4](#mutation_internal_indels_analysis_G4)
        4. [Save/load counts](#mutation_internal_indels_saveload)
        5. [Plots](#mutation_internal_indels_plots)
            1. [Plot for Figure 4a](#mutation_internal_indels_plots_fig4A)
            2. [Plot for Figure S4b](#mutation_internal_indels_plots_figS4B)
            3. [Plot for Figure S5b](#mutation_internal_indels_plots_figS5B)
            4. [Plot for Figure S6b](#mutation_internal_indels_plots_figS6B)
            5. [Plot for Figure S3a](#mutation_internal_indels_plots_figS3A)
            6. [Plots for Figure S5a, S5b](#mutation_internal_indels_plots_figS5)
            7. [Plot for Figure 3a](#mutation_internal_combined_plots_fig3A)
            8. [Plot for Figure S6c](#mutation_internal_combined_plots_figS6C)
        6. [STR insertion fidelity](#mutation_internal_STR_insertion_fidelity)
            1. [Calculation of error frequency](#mutation_internal_STR_insertion_fidelity_calculation)
            2. [Plot for Figure S3d](#mutation_internal_STR_insertion_fidelity_plots_figS3D)
        7. [Ratio of insertions to deletions](#mutation_internal_STR_insertion_deletion_ratio)
            1. [Plot for Figure S3c](#mutation_internal_STR_insertion_deletion_ratio_plots_figS3C)
        8. [Absolute frequency of indels and SNVs](#mutation_internal_STR_indel_SNV_rate)
            1. [Plot for Figure S3e](#mutation_internal_STR_indel_SNV_rate_plots_figS3E)
        9. [Direct repeat duplications](#mutation_internal_indel_DR_duplications)
            1. [Find and count duplications](#mutation_internal_indel_DR_duplications_count)
            2. [Plot for Figure S4d](#mutation_internal_indel_DR_duplications_plot_figS4D)
            3. [Duplication error frequency](#mutation_internal_indel_DR_duplications_errors)
            4. [Plot for Figure S4e](#mutation_internal_indel_DR_duplications_plot_figS4E)
            5. [Double-nucleotide variants, used in Fig. S4f](#mutation_internal_indel_DR_duplications_figS4F)

# Process sample dataset (chr22) or full dataset? <a name="sample"></a>
    Run one or the other as desired
[Return to Table of Contents](#TOC)

In [None]:
# Sample dataset
chr_list = [22]
chr_range = 22

In [None]:
# Full dataset (all autosomes)
chr_list = list(range(1,23))
chr_range = 1

# Libraries and functions used throughout the analysis
<a name="libraries"></a>

### Load Python libraries <a name="libraries1"></a>
- Links included for instructions on how to install libraries using pip or conda (if libraries not included with conda)

[Return to Table of Contents](#TOC)

In [None]:
from Bio import SeqIO     # https://biopython.org/wiki/SeqIO
from Bio.SeqIO.FastaIO import SimpleFastaParser
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord

import pandas as pd
import numpy as np
import os
from time import time
import pickle
import statsmodels.api as statsmodels
binconf = statsmodels.stats.proportion_confint
from sty import fg, bg, rs, ef, Style, RgbFg      #pip install sty
import regex as re     # https://pypi.org/project/regex/
from pyliftover import LiftOver     # https://pypi.org/project/pyliftover/

import plotly     # https://plotly.com/python/getting-started/
plotly.offline.init_notebook_mode()
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import plotly.io as pio
pio.templates.default = "none"

In [None]:
# Tested with the following versions:

#pd.show_versions()

#python           : 3.8.12.final.0
#pandas           : 1.4.1
#numpy            : 1.20.3
#scipy            : 1.7.3

### Define functions which will be used throughout the analysis <a name="functions"></a>
[Return to Table of Contents](#TOC)

In [None]:
# Flatten list
flatten = lambda l: [item for sublist in l for item in sublist]

# Merge any overlapping genomic windows
def interval_overlap_per_chrom(df_in, start_name = 'start', end_name = 'end'):
    df_out = dict()
    for chrom in range(1,23):
        df_current = df_in.loc[df_in['chrom'] == chrom]
        if len(df_current) > 0:
            range_df = pd.DataFrame()
            range_df[start_name] = sorted(list(df_current[start_name]))
            range_df[end_name] = sorted(list(df_current[end_name]))

            range_df_shift = pd.DataFrame()
            range_df_shift[start_name] = range_df[start_name][1:].reset_index(drop = True)
            range_df_shift[end_name] = range_df[end_name][:-1].reset_index(drop = True)
            range_df_shift['overlap'] = range_df_shift[start_name] > range_df_shift[end_name]
            range_df_shift = range_df_shift.loc[range_df_shift['overlap'] == True]

            range_df_merged = pd.DataFrame()
            range_df_merged[start_name] = [range_df[start_name][0]] +list(range_df_shift[start_name])
            range_df_merged[end_name] = list(range_df_shift[end_name]) + [range_df[end_name][len(range_df.index)-1]]

            range_df = range_df_merged
            range_df['chrom'] = chrom
            df_out[chrom] = range_df[['chrom', start_name, end_name]]
    df_out = pd.concat(df_out).reset_index(drop = True)
    return df_out

# Count trinucleotides for a list of genomic coordinates
def count_triplets_by_chrom(df_in):
    df_in['seq'] = [reference_genome[chrom][start-2:end+1] for chrom, start, end in zip(df_in['chrom'], df_in['start'], df_in['end'])]

    triplet_counts = pd.Series(0, index = all_triplets)
    for chrom in range(1,23):
        current_seq = 'NNN'.join(df_in.loc[df_in['chrom'] == chrom]['seq'])
        if len(current_seq) > 0:
            triplet_counts = triplet_counts.add(pd.Series(re.findall('...',current_seq)).value_counts()).add(pd.Series(re.findall('...',current_seq[1:])).value_counts()).add(pd.Series(re.findall('...',current_seq[2:])).value_counts()).dropna().astype(int)
    return triplet_counts

# Count 5mers for a list of genomic coordinates
def count_quints_by_chrom(df_in):
    df_in['seq'] = [reference_genome[chrom][start-3:end+2] for chrom, start, end in zip(df_in['chrom'], df_in['start'], df_in['end'])]

    triplet_counts = pd.Series(0, index = all_triplets)
    for chrom in range(chr_range,23):
        current_seq = 'NNNNN'.join(df_in.loc[df_in['chrom'] == chrom]['seq'])
        if len(current_seq) > 0:
            triplet_counts = triplet_counts.add(pd.Series(re.findall('.....',current_seq)).value_counts()).add(pd.Series(re.findall('.....',current_seq[1:])).value_counts()).add(pd.Series(re.findall('.....',current_seq[2:])).value_counts()).add(pd.Series(re.findall('.....',current_seq[3:])).value_counts()).add(pd.Series(re.findall('.....',current_seq[4:])).value_counts()).dropna().astype(int)
    return triplet_counts

# Generate reverse-complementary sequence or mutation
def reverse_complement(dna):
    complement = {'A': 'T', 'C': 'G', 'G': 'C', 'T': 'A', 'N': 'N'}
    return ''.join([complement[base] for base in dna[::-1]]) 
def reverse_complement_mut(mut):
    complement = {'A': 'T', 'C': 'G', 'G': 'C', 'T': 'A', 'N': 'N'}
    return ''.join([complement[base] for base in mut.split('_')[0][::-1]])+'_'+complement[mut.split('_')[1]]

# Make lists of all trinucleotides and trinucleotide mutations
all_triplets = [base1+base2+base3 for base1 in ['A', 'T', 'G', 'C'] for base2 in ['A', 'T', 'G', 'C'] for base3 in ['A', 'T', 'G', 'C']]
triplet_mutations_und = [base1+base2+base3+'_'+mutbase for base1 in ['A', 'T', 'G', 'C'] for base2 in ['A', 'T', 'G', 'C'] for base3 in ['A', 'T', 'G', 'C'] for mutbase in ['A', 'T', 'G', 'C'] if mutbase != base2]

# Reverse-complementary trinucleotides
triplets_TC = [base1+base2+base3 for base1 in ['A', 'T', 'G', 'C'] for base2 in ['T', 'C'] for base3 in ['A', 'T', 'G', 'C']]
triplets_AG = [reverse_complement(mut) for mut in triplets_TC]
triplet_mutations_und_TC = [base1+base2+base3+'_'+mutbase for base1 in ['A', 'T', 'G', 'C'] for base2 in ['T', 'C'] for base3 in ['A', 'T', 'G', 'C'] for mutbase in ['A', 'T', 'G', 'C'] if mutbase != base2]
triplet_mutations_und_AG = [reverse_complement(mut[:3]) + '_' + reverse_complement(mut[4]) for mut in triplet_mutations_und_TC]

# Make lists of all 5mers and 5mer mutations
all_quints = [base1+base2+base3+base4+base5 for base1 in ['A', 'T', 'G', 'C'] for base2 in ['A', 'T', 'G', 'C'] for base3 in ['A', 'T', 'G', 'C'] for base4 in ['A', 'T', 'G', 'C'] for base5 in ['A', 'T', 'G', 'C']]
quint_mutations = [base1+base2+base3+base4+base5+'>'+mutbase for base1 in ['A', 'T', 'G', 'C'] for base2 in ['A', 'T', 'G', 'C'] for base3 in ['A', 'T', 'G', 'C'] for base4 in ['A', 'T', 'G', 'C'] for base5 in ['A', 'T', 'G', 'C'] for mutbase in ['A', 'T', 'G', 'C'] if mutbase != base3]
quint_mutations_und = [base1+base2+base3+base4+base5+'_'+mutbase for base1 in ['A', 'T', 'G', 'C'] for base2 in ['A', 'T', 'G', 'C'] for base3 in ['A', 'T', 'G', 'C'] for base4 in ['A', 'T', 'G', 'C'] for base5 in ['A', 'T', 'G', 'C'] for mutbase in ['A', 'T', 'G', 'C'] if mutbase != base3]

# Reverse-complementary 5mers
quints_TC = [base1+base2+base3+base4+base5 for base1 in ['A', 'T', 'G', 'C'] for base2 in ['A', 'T', 'G', 'C'] for base3 in ['T', 'C'] for base4 in ['A', 'T', 'G', 'C'] for base5 in ['A', 'T', 'G', 'C']]
quints_AG = [reverse_complement(mut) for mut in quints_TC]
quint_mutations_und_TC = [base1+base2+base3+base4+base5+'_'+mutbase for base1 in ['A', 'T', 'G', 'C'] for base2 in ['A', 'T', 'G', 'C'] for base3 in ['T', 'C'] for base4 in ['A', 'T', 'G', 'C']  for base5 in ['A', 'T', 'G', 'C'] for mutbase in ['A', 'T', 'G', 'C'] if mutbase != base3]
quint_mutations_und_AG = [reverse_complement(mut[:5]) + '_' + reverse_complement(mut[6]) for mut in quint_mutations_und_TC]

# Make lists of all trinucleotide indel mutations
triplet_mutations_und_indel = [tri + '_ins' for tri in all_triplets] + [tri + '_del' for tri in all_triplets]
triplet_mutations_und_indel_AG = [tri + '_ins' for tri in triplets_AG] + [tri + '_del' for tri in triplets_AG]
triplet_mutations_und_indel_TC = [tri + '_ins' for tri in triplets_TC] + [tri + '_del' for tri in triplets_TC]
triplet_mutations_und_ins = [tri + '_ins' for tri in all_triplets]
triplet_mutations_und_ins_AG = [tri + '_ins' for tri in triplets_AG]
triplet_mutations_und_ins_TC = [tri + '_ins' for tri in triplets_TC]
triplet_mutations_und_del = [tri + '_del' for tri in all_triplets]
triplet_mutations_und_del_AG = [tri + '_del' for tri in triplets_AG]
triplet_mutations_und_del_TC = [tri + '_del' for tri in triplets_TC]

def tri_function(chrom, pos, base = 1):     # used for base1 coordinates
    if base == 1:
        return reference_genome[chrom][pos-2: pos+1].upper()
    if base == 0:
        return reference_genome[chrom][pos-1: pos+2].upper()
def quint_function(chrom, pos): 
    return reference_genome[chrom][pos-3: pos+2].upper()
def reference_lookup(chrom, pos, distance):
    return reference_genome[chrom][pos-(distance+1): pos+distance].upper()

def triplet_combine_RC(input_series, mut_input = False, mut_output = False, decombine = False):
    if (mut_input == False) & (decombine == False):
        triplet_totals_RC = input_series.reindex(triplets_AG).fillna(0)
        triplet_totals_RC.index = triplets_TC
        triplet_totals_RC = triplet_totals_RC.add(input_series.reindex(triplets_TC), fill_value = 0)
        if mut_output == True:
            triplet_totals_RC_mut = triplet_totals_RC.reindex(flatten([[mut]*3 for mut in triplets_TC]))
            triplet_totals_RC_mut.index = triplet_mutations_und_TC
            return triplet_totals_RC_mut
        else:
            return triplet_totals_RC
    if (mut_input == False) & (decombine == True):
        triplet_totals_RC = input_series.reindex(triplets_TC).fillna(0)
        triplet_totals_RC.index = triplets_AG
        triplet_totals = pd.concat([input_series.reindex(triplets_TC), triplet_totals_RC], axis=0).reindex(all_triplets).fillna(0)
        if mut_output == True:
            triplet_totals_mut = triplet_totals.reindex(flatten([[mut]*3 for mut in all_triplets])).fillna(0)
            triplet_totals_mut.index = triplet_mutations_und
            return triplet_totals_mut
        else:
            return triplet_totals
    if (mut_input == True) & (decombine == False):
        triplet_totals_RC = input_series.reindex(triplet_mutations_und_AG).fillna(0).fillna(0)
        triplet_totals_RC.index = triplet_mutations_und_TC
        triplet_totals_RC = triplet_totals_RC.add(input_series.reindex(triplet_mutations_und_TC).fillna(0), fill_value = 0)
        return triplet_totals_RC
    if (mut_input == True) & (decombine == True):
        triplet_totals_RC = input_series.reindex(triplet_mutations_und_TC).fillna(0).fillna(0)
        triplet_totals_RC.index = triplet_mutations_und_AG
        triplet_totals = pd.concat([input_series.reindex(triplet_mutations_und_TC).fillna(0), triplet_totals_RC], axis=0).fillna(0)
        triplet_totals = triplet_totals.reindex(triplet_mutations_und).fillna(0)
        return triplet_totals

def triplet_combine_RC_indel(input_series, mut_input = False, mut_output = False, decombine = False):
    if (mut_input == False) & (decombine == False):
        triplet_totals_RC = input_series.reindex(triplets_AG).fillna(0)
        triplet_totals_RC.index = triplets_TC
        triplet_totals_RC = triplet_totals_RC.add(input_series.reindex(triplets_TC), fill_value = 0)
        if mut_output == True:
            triplet_totals_RC_mut = pd.concat([triplet_totals_RC,triplet_totals_RC])
            triplet_totals_RC_mut.index = triplet_mutations_und_indel_TC
            return triplet_totals_RC_mut
        else:
            return triplet_totals_RC
    if (mut_input == False) & (decombine == True):
        triplet_totals_RC = input_series.reindex(triplets_TC).fillna(0)
        triplet_totals_RC.index = triplets_AG
        triplet_totals = pd.concat([input_series.reindex(triplets_TC), triplet_totals_RC], axis=0).reindex(all_triplets).fillna(0)
        if mut_output == True:
            triplet_totals_mut = pd.concat([triplet_totals,triplet_totals])
            triplet_totals_mut.index = triplet_mutations_und_indel
            return triplet_totals_mut
        else:
            return triplet_totals
    if (mut_input == True) & (decombine == False):
        triplet_totals_RC = input_series.reindex(triplet_mutations_und_indel_AG).fillna(0)
        triplet_totals_RC.index = triplet_mutations_und_indel_TC
        triplet_totals_RC += triplet_totals_RC.add(input_series.reindex(triplet_mutations_und_indel_TC).fillna(0), fill_value = 0)
        return triplet_totals_RC
    if (mut_input == True) & (decombine == True):
        triplet_totals_RC = input_series.reindex(triplet_mutations_und_indel_TC).fillna(0)
        triplet_totals_RC.index = triplet_mutations_und_indel_AG
        triplet_totals = pd.concat([input_series.reindex(triplet_mutations_und_indel_TC).fillna(0), triplet_totals_RC], axis=0)
        triplet_totals = triplet_totals.reindex(triplet_mutations_und_indel)
        return triplet_totals

# COSMIC mutation signature colors
colors = pd.DataFrame(index = ['C_A', 'C_G', 'C_T', 'T_A', 'T_C', 'T_G'])
colors['color'] = ['#2ACAFA', 'black', 'red', '#C0C0C0', '#92FA2A', 'pink']
colors['ind'] = [[base1+'C'+base3+'_'+'A' for base1 in ['A', 'C', 'G', 'T'] for base3 in ['A', 'C', 'G', 'T']], [base1+'C'+base3+'_'+'G' for base1 in ['A', 'C', 'G', 'T'] for base3 in ['A', 'C', 'G', 'T']], [base1+'C'+base3+'_'+'T' for base1 in ['A', 'C', 'G', 'T'] for base3 in ['A', 'C', 'G', 'T']], [base1+'T'+base3+'_'+'A' for base1 in ['A', 'C', 'G', 'T'] for base3 in ['A', 'C', 'G', 'T']], [base1+'T'+base3+'_'+'C' for base1 in ['A', 'C', 'G', 'T'] for base3 in ['A', 'C', 'G', 'T']], [base1+'T'+base3+'_'+'G' for base1 in ['A', 'C', 'G', 'T'] for base3 in ['A', 'C', 'G', 'T']]]
colors['ind_quint'] = [[base1+base2+'C'+base4+base5+'_'+'A' for base1 in ['A', 'C', 'G', 'T'] for base2 in ['A', 'C', 'G', 'T'] for base4 in ['A', 'C', 'G', 'T'] for base5 in ['A', 'C', 'G', 'T']], [base1+base2+'C'+base4+base5+'_'+'G' for base1 in ['A', 'C', 'G', 'T'] for base2 in ['A', 'C', 'G', 'T'] for base4 in ['A', 'C', 'G', 'T'] for base5 in ['A', 'C', 'G', 'T']], [base1+base2+'C'+base4+base5+'_'+'T' for base1 in ['A', 'C', 'G', 'T'] for base2 in ['A', 'C', 'G', 'T'] for base4 in ['A', 'C', 'G', 'T'] for base5 in ['A', 'C', 'G', 'T']], [base1+base2+'T'+base4+base5+'_'+'A' for base1 in ['A', 'C', 'G', 'T'] for base2 in ['A', 'C', 'G', 'T'] for base4 in ['A', 'C', 'G', 'T'] for base5 in ['A', 'C', 'G', 'T']], [base1+base2+'T'+base4+base5+'_'+'C' for base1 in ['A', 'C', 'G', 'T'] for base2 in ['A', 'C', 'G', 'T'] for base4 in ['A', 'C', 'G', 'T'] for base5 in ['A', 'C', 'G', 'T']], [base1+base2+'T'+base4+base5+'_'+'G' for base1 in ['A', 'C', 'G', 'T'] for base2 in ['A', 'C', 'G', 'T'] for base4 in ['A', 'C', 'G', 'T'] for base5 in ['A', 'C', 'G', 'T']]]
colors['ind_F'] = [mut for mut in triplet_mutations_und_TC if (mut[1] == 'C') & (mut[4] == 'A')], [mut for mut in triplet_mutations_und_TC if (mut[1] == 'C') & (mut[4] == 'G')], [mut for mut in triplet_mutations_und_TC if (mut[1] == 'C') & (mut[4] == 'T')], [mut for mut in triplet_mutations_und_TC if (mut[1] == 'T') & (mut[4] == 'A')], [mut for mut in triplet_mutations_und_TC if (mut[1] == 'T') & (mut[4] == 'C')], [mut for mut in triplet_mutations_und_TC if (mut[1] == 'T') & (mut[4] == 'G')]
colors['ind_RC'] = [[reverse_complement_mut(mut) for mut in triplets] for triplets in colors['ind_F']]
colors['ind_all'] = colors['ind_F'] + colors['ind_RC']
#colors['ind'] = [[base1+'CA'+'_'+'A' for base1 in ['A', 'T', 'G', 'C']] + [base1+'CT'+'_'+'A' for base1 in ['A', 'T', 'G', 'C']] + [base1+'CC'+'_'+'A' for base1 in ['A', 'T', 'G', 'C']] + [base1+'CG'+'_'+'A' for base1 in ['A', 'T', 'G', 'C']], [base1+'CA'+'_'+'G' for base1 in ['A', 'T', 'G', 'C']] + [base1+'CT'+'_'+'G' for base1 in ['A', 'T', 'G', 'C']] + [base1+'CC'+'_'+'G' for base1 in ['A', 'T', 'G', 'C']] + [base1+'CG'+'_'+'G' for base1 in ['A', 'T', 'G', 'C']], [base1+'CA'+'_'+'T' for base1 in ['A', 'T', 'G', 'C']] + [base1+'CT'+'_'+'T' for base1 in ['A', 'T', 'G', 'C']] + [base1+'CC'+'_'+'T' for base1 in ['A', 'T', 'G', 'C']] + [base1+'CG'+'_'+'T' for base1 in ['A', 'T', 'G', 'C']], [base1+'TA'+'_'+'A' for base1 in ['A', 'T', 'G', 'C']] + [base1+'TT'+'_'+'A' for base1 in ['A', 'T', 'G', 'C']] + [base1+'TC'+'_'+'A' for base1 in ['A', 'T', 'G', 'C']] + [base1+'TG'+'_'+'A' for base1 in ['A', 'T', 'G', 'C']], [base1+'TA'+'_'+'C' for base1 in ['A', 'T', 'G', 'C']] + [base1+'TT'+'_'+'C' for base1 in ['A', 'T', 'G', 'C']] + [base1+'TC'+'_'+'C' for base1 in ['A', 'T', 'G', 'C']] + [base1+'TG'+'_'+'C' for base1 in ['A', 'T', 'G', 'C']], [base1+'TA'+'_'+'G' for base1 in ['A', 'T', 'G', 'C']] + [base1+'TT'+'_'+'G' for base1 in ['A', 'T', 'G', 'C']] + [base1+'TC'+'_'+'G' for base1 in ['A', 'T', 'G', 'C']] + [base1+'TG'+'_'+'G' for base1 in ['A', 'T', 'G', 'C']]]
colors['col_num'] = [1,2,3,4,5,6]

triplets_by_mut = colors['ind'].copy()
triplets_by_mut['CpG_T'] = [tri for tri in triplets_by_mut['C_T'] if tri[2] == 'G']
triplets_by_mut['C_T'] = [tri for tri in triplets_by_mut['C_T'] if tri[2] != 'G']

# Use a set of coordinates to mask a fasta sequence and output to new fasta file
def coordinates_to_Ns(coordinates, reference, description_str, name):
    coordinates = interval_overlap_per_chrom(coordinates)
    for chrom in range(chr_range,23):
        chr_seq = reference[chrom]
        chr_coordinates = coordinates.loc[coordinates['chrom'] == chrom]
        
        mask_ranges = []
        for coord in list(zip(chr_coordinates['start'], chr_coordinates['end'])):
            mask_ranges.append(list(range(coord[0], coord[1])))
        mask_ranges = flatten(mask_ranges)
        mask_ranges = list(set(mask_ranges))

        ### replace list of positions with Ns
        chr_seq = list(chr_seq)
        for pos in mask_ranges:
            chr_seq[pos] = 'N'
        chr_seq = ''.join(chr_seq)

        record = SeqRecord(Seq(chr_seq), id = 'DNA', description = description_str)
        with open('./hg38/chromosomes/masked/chr'+str(chrom)+'_mask_'+name+'.fasta', 'w') as output_handle:
            SeqIO.write(record, output_handle, "fasta")
        print('\r' + str(chrom), end=' ')

# Calculate distance between coordinates within the same dataframe
def distance_within_df(df, name = '', start_name = 'start', end_name = 'end'):
    if name != '':
        name = name + '_'
    df_distance = dict()
    df_internals = dict()
    df_duplicated = dict()
    for chrom in range(chr_range,23):
        current_chrom = df.loc[df['chrom'] == chrom].copy()
        current_chrom = current_chrom.sort_values(by = [start_name, end_name])
        current_chrom['end_max'] = current_chrom[end_name].cummax()
        current_chrom['end_max-end'] = current_chrom['end_max'] - current_chrom[end_name]
        # set aside coordinates fully contained inside other coordinates
        # internals will have NaN for distance
        df_internals[chrom] = current_chrom.loc[current_chrom['end_max-end'] >0].copy()
        df_internals[chrom][name+'distance_left'] = np.nan; df_internals[chrom][name+'distance_right'] = np.nan
        current_chrom = current_chrom.loc[current_chrom['end_max-end'] == 0]
        # remove exact duplicates (keeping one set, setting aside the rest)
        df_duplicated[chrom] = current_chrom.loc[current_chrom.duplicated(subset = [start_name, end_name]) == True]
        current_chrom = current_chrom.drop_duplicates(subset = [start_name, end_name])
        # calculate L and R distance
        # all coordinates should be ordered by both start and end, so distance <=0 indicates overlap
        # start and end of chromosome will be NaN
        current_chrom[name+'distance_left'] = current_chrom[start_name] - current_chrom[end_name].shift(1)
        current_chrom[name+'distance_right'] = current_chrom[start_name].shift(-1) - current_chrom[end_name]
        df_distance[chrom] = current_chrom
        df_distance[chrom].index = [str(chrom)+'_'+str(start)+'_'+str(end) for start, end in zip(df_distance[chrom][start_name], df_distance[chrom][end_name])]
        # copy distances to duplicated coordinates (they will show distances to nearest non-identical coordinates)
        df_duplicated[chrom]['coordinates'] = [str(chrom)+'_'+str(start)+'_'+str(end) for start, end in zip(df_duplicated[chrom][start_name], df_duplicated[chrom][end_name])]
        df_duplicated[chrom][name+'distance_left'] = list(df_distance[chrom].reindex(df_duplicated[chrom]['coordinates'])[name+'distance_left'])
        df_duplicated[chrom][name+'distance_right'] = list(df_distance[chrom].reindex(df_duplicated[chrom]['coordinates'])[name+'distance_right'])

    df_distance = pd.concat(df_distance)
    df_duplicated = pd.concat(df_duplicated)
    df_internals = pd.concat(df_internals)
    
    df_distance[name+'distance_left'] = df_distance[name+'distance_left'].fillna(np.inf)
    df_distance[name+'distance_right'] = df_distance[name+'distance_right'].fillna(np.inf)
    df_duplicated[name+'distance_left'] = df_duplicated[name+'distance_left'].fillna(np.inf)
    df_duplicated[name+'distance_right'] = df_duplicated[name+'distance_right'].fillna(np.inf)

    df = pd.concat([df_distance, df_duplicated, df_internals])
    df = df.sort_values(by = ['chrom', start_name, end_name]).reset_index(drop = True)
    del df['coordinates']; del df['end_max']; del df['end_max-end']
    df[name+'distance_min'] = [min([L,R]) for L, R in zip(df[name+'distance_left'], df[name+'distance_right'])]
     # if either side is negative, set both values to 0
    df[name+'distance_left'] = [dist if dist_min > 0 else 0 for dist, dist_min in zip(df[name+'distance_left'], df[name+'distance_min'])]
    df[name+'distance_right'] = [dist if dist_min > 0 else 0 for dist, dist_min in zip(df[name+'distance_right'], df[name+'distance_min'])]
    
    return df

# Calculate distance from all coordinates in first dataframe to nearest coordinates in second dataframe
def measure_distance(df_measure, df_compare, comp_name, start_name = 'start', end_name = 'end'):
    df_measure = df_measure.sort_values(by = ['chrom', start_name, end_name])
    df_compare_merged = interval_overlap_per_chrom(df_compare)
    df_compare_merged.columns = ['chrom', start_name, end_name]
    df_measure_comparedist = dict()
    for chrom in range(chr_range,23):
        current_measure = df_measure.loc[df_measure['chrom'] == chrom].copy()
        current_compare = df_compare_merged.loc[df_compare_merged['chrom'] == chrom].copy()
        current_measure[comp_name+'_index'] = np.searchsorted(current_compare[start_name], current_measure[start_name], side = 'left')
        current_measure['nearestcompare_left_end'] = list(current_compare.reset_index().reindex(current_measure[comp_name+'_index']-1)[end_name])
        current_measure['nearestcompare_right_start'] = list(current_compare.reset_index().reindex(current_measure[comp_name+'_index'])[start_name])
        current_measure[comp_name+'_distance_left'] = (current_measure[start_name] - current_measure['nearestcompare_left_end']).fillna(np.inf)
        current_measure[comp_name+'_distance_right'] = (current_measure['nearestcompare_right_start'] - current_measure[end_name]).fillna(np.inf)
        df_measure_comparedist[chrom] = current_measure
    df_measure_comparedist = pd.concat(df_measure_comparedist).sort_values(by = ['chrom', start_name, end_name]).reset_index(drop = True)
    del df_measure_comparedist[comp_name+'_index']; del df_measure_comparedist['nearestcompare_left_end']; del df_measure_comparedist['nearestcompare_right_start']
    df_measure_comparedist[comp_name+'_distance_min'] = [min([L,R]) for L, R in zip(df_measure_comparedist[comp_name+'_distance_left'], df_measure_comparedist[comp_name+'_distance_right'])]
    # if either side is negative, set both values to 0
    df_measure_comparedist[comp_name+'_distance_left'] = [dist if dist_min > 0 else 0 for dist, dist_min in zip(df_measure_comparedist[comp_name+'_distance_left'], df_measure_comparedist[comp_name+'_distance_min'])]
    df_measure_comparedist[comp_name+'_distance_right'] = [dist if dist_min > 0 else 0 for dist, dist_min in zip(df_measure_comparedist[comp_name+'_distance_right'], df_measure_comparedist[comp_name+'_distance_min'])]
    
    return df_measure_comparedist

# Calculate distance from all point coordinates in first dataframe to nearest start/end coordinates in second dataframe
def measure_distance_point(df_measure, df_compare, pos_column_name, comp_name, start_name = 'start', end_name = 'end'):
    df_measure = df_measure.sort_values(by = ['chrom', pos_column_name])
    df_compare_merged = interval_overlap_per_chrom(df_compare)
    df_measure_comparedist = dict()
    for chrom in range(chr_range,23):
        current_compare = df_compare_merged.loc[df_compare_merged['chrom'] == chrom].copy()
        current_compare_combined = pd.DataFrame(pd.concat([current_compare[start_name], current_compare[end_name]]).sort_values(), columns = ['pos']).reset_index(drop = True)
        current_compare_combined['se'] = ['s', 'e']*round(len(current_compare_combined)/2)

        current_measure = df_measure.loc[df_measure['chrom'] == chrom].reset_index(drop = True).copy()
        current_measure[['current_left_pos', 'current_left_se']] = current_compare_combined.reindex(np.searchsorted(current_compare_combined['pos'], current_measure[pos_column_name])-1).reset_index(drop = True)
        current_measure[['current_right_pos', 'current_right_se']] = current_compare_combined.reindex(np.searchsorted(current_compare_combined['pos'], current_measure[pos_column_name])).reset_index(drop = True)
        current_measure['current_se'] = current_measure['current_left_se'] + current_measure['current_right_se']
        current_measure[comp_name + '_distance_left'] = ([pos - left if se == 'es' else 0 for pos, left, se in zip(current_measure[pos_column_name], current_measure['current_left_pos'], current_measure['current_se'])]).fillna(np.inf)
        current_measure[comp_name + '_distance_right'] = ([right - pos if se == 'es' else 0 for pos, right, se in zip(current_measure[pos_column_name], current_measure['current_right_pos'], current_measure['current_se'])]).fillna(np.inf)
        
        df_measure_comparedist[chrom] = current_measure
    df_measure_comparedist = pd.concat(df_measure_comparedist).sort_values(by = ['chrom', pos_column_name]).reset_index(drop = True)
    del df_measure_comparedist['current_left_pos']; del df_measure_comparedist['current_left_se']; del df_measure_comparedist['current_right_pos']; del df_measure_comparedist['current_right_se']; del df_measure_comparedist['current_se']
    df_measure_comparedist[comp_name+'_distance_min'] = [min([L,R]) for L, R in zip(df_measure_comparedist[comp_name+'_distance_left'], df_measure_comparedist[comp_name+'_distance_right'])]
    # if either side is negative, set both values to 0
    df_measure_comparedist[comp_name+'_distance_left'] = [dist if dist_min > 0 else 0 for dist, dist_min in zip(df_measure_comparedist[comp_name+'_distance_left'], df_measure_comparedist[comp_name+'_distance_min'])]
    df_measure_comparedist[comp_name+'_distance_right'] = [dist if dist_min > 0 else 0 for dist, dist_min in zip(df_measure_comparedist[comp_name+'_distance_right'], df_measure_comparedist[comp_name+'_distance_min'])]
    return df_measure_comparedist

def make_colorscale(list_of_traces, opacity = 0.15):
    current_colors = pd.Series(['rgb('+str(current/len(list_of_traces)*255) + ', 180, '+str((len(list_of_traces)-current)/len(list_of_traces)*255)+')' for current in range(len(list_of_traces))], index = list_of_traces)
    return pd.Series(current_colors, index = list_of_traces), pd.Series(['rgba' + color[3:-1] + ', '+str(opacity)+')' for color in current_colors], index = list_of_traces)

def make_default_colors(list_of_traces, opacity = 0.15, last_black = True):
    current_colors = plotly.colors.DEFAULT_PLOTLY_COLORS + plotly.colors.qualitative.Dark2
    current_colors = current_colors*max(round(len(list_of_traces)/len(current_colors))+1, 1)
    if last_black == True:
        current_colors = current_colors[:len(list_of_traces)-1] + ['rgb(0,0,0)']
    else:
        current_colors = current_colors[:len(list_of_traces)]
    return pd.Series(current_colors, index = list_of_traces), pd.Series(['rgba' + color[3:-1] + ', ' + str(opacity) + ')' for color in current_colors], index = list_of_traces)

def reverse_complement_dnv(mut):
    complement = {'A': 'T', 'C': 'G', 'G': 'C', 'T': 'A', 'N': 'N'}
    return ''.join([complement[base] for base in mut.split('>')[0][::-1]])+'>'+''.join([complement[base] for base in mut.split('>')[1][::-1]])
dinuc = [base1+base2 for base1 in ['A', 'T', 'G', 'C'] for base2 in ['A', 'T', 'G', 'C']]
dinuc_F = ['AA', 'AG', 'AC', 'TG', 'TC', 'GG']
dinuc_RC = ['TT', 'CT', 'GT', 'CA', 'GA', 'CC']
dinuc_sym = ['AT', 'TA', 'GC', 'CG']
dinuc_mut_F = [di1 + '>' + di2 for di1 in dinuc_F for di2 in dinuc if (di1[0] != di2[0]) & (di1[1] != di2[1])]
dinuc_mut_RC = [reverse_complement_dnv(mut) for mut in dinuc_mut_F]
dinuc_mut_sym = [di1 + '>' + di2 for di1 in dinuc_sym for di2 in dinuc if (di1[0] != di2[0]) & (di1[1] != di2[1])]

# Process hg38 human genome fasta  <a name="hg38"></a>
- Measure trinucleotide content
- Apply RepeatMasker to remove transposable elements

#### Download genome fasta files  <a name="hg38download"></a>
- hg38 chromosome fasta files available at: https://hgdownload.soe.ucsc.edu/goldenPath/hg38/chromosomes/
- Extract .fa.gz
- Place .fa files in subfolder "./hg38/chromosomes/"

#### Read hg38 fasta files  <a name="hg38read"></a>

[Return to Table of Contents](#TOC)

In [None]:
reference_genome = dict()
for chrom in chr_list:
    with open('./hg38/chromosomes/chr'+str(chrom)+'.fa') as fasta_file:
        for sequence in SimpleFastaParser(fasta_file):
            chr_seq = sequence
        reference_genome[chrom] = chr_seq[1]
        reference_genome[chrom] = reference_genome[chrom].upper()

#### Count trinucleotides per chromosome  <a name="hg38count"></a>
[Return to Table of Contents](#TOC)

In [None]:
# Count all trinucleotides in hg38
triplet_df = pd.Series(0, index = all_triplets)

for chrom in range(chr_range,23):
    triplet_df = triplet_df.add(pd.Series(re.findall('...',reference_genome[chrom])).value_counts()).add(pd.Series(re.findall('...',reference_genome[chrom][1:])).value_counts()).add(pd.Series(re.findall('...',reference_genome[chrom][2:])).value_counts()).dropna().astype(int)
    print('\r' + str(chrom), end=' ')
genome_triplet_totals = pd.DataFrame()
genome_triplet_totals['hg38'] = triplet_df

In [None]:
# Save triplet totals in .csv format
genome_triplet_totals.to_csv('./hg38/triplet_totals_hg38_chr'+str(chr_range)+'-22.csv')

In [None]:
# Load triplet totals
genome_triplet_totals = pd.read_csv('./hg38/triplet_totals_hg38_chr'+str(chr_range)+'-22.csv', index_col = 0)

In [None]:
# Count all dinucleotides in hg38
di_df = pd.Series(0, index = dinuc)

for chrom in range(chr_range,23):
    di_df = di_df.add(pd.Series(re.findall('..',reference_genome[chrom])).value_counts()).add(pd.Series(re.findall('..',reference_genome[chrom][1:])).value_counts()).dropna().astype(int)
    print('\r' + str(chrom), end=' ')
genome_di_totals = pd.DataFrame()
genome_di_totals['hg38'] = di_df

# Save dinuc totals in .csv format
genome_di_totals.to_csv('./hg38/dinuc_totals_hg38_chr'+str(chr_range)+'-22.csv')

In [None]:
# Load dinuc totals
genome_di_totals = pd.read_csv('./hg38/dinuc_totals_hg38_chr'+str(chr_range)+'-22.csv', index_col = 0)

# Use Repeatmasker to remove transposable elements from hg38 <a name="RM"></a>

#### Download Repeatmasker <a name="RMdownload"></a>
- Download Repeatmasker from UCSC Table Browser: https://genome.ucsc.edu/cgi-bin/hgTables
- Select options:
    - assembly: Dec 2013 (GRCh38/hg38)
    - group: Repeats
    - track: RepeatMasker
    - region: genome
    - output format: all fields from selected table
    - output filename: hg38_repeatmasker.bed
    - file type returned: gzip compressed
- Place .bed.gz file in subfolder './hg38/repeatmasker/'

#### Read and format Repeatmasker file <a name="RMread"></a>

[Return to Table of Contents](#TOC)

In [None]:
repeatmasker = pd.read_csv('./hg38/repeatmasker/hg38_repeatmasker.bed.gz', sep = '\t', usecols = ['genoName', 'genoStart', 'genoEnd', 'strand', 'repName', 'repClass', 'repFamily'])
repeatmasker.columns = ['chrom', 'start', 'end', 'strand', 'repName', 'repClass', 'repFamily']
repeatmasker['chrom'] = [chrom[3:] for chrom in repeatmasker['chrom']]
repeatmasker = repeatmasker.loc[repeatmasker['chrom'].isin([str(chrom) for chrom in range(chr_range,23)])]
repeatmasker['chrom'] = repeatmasker['chrom'].astype(int)
repeatmasker = repeatmasker.reset_index(drop = True)

# Remove 'Low complexity' and 'Simple Repeat' categories from Repeatmasker
repeatmasker = repeatmasker.loc[repeatmasker['repClass'].isin(['Low_complexity', 'Simple_repeat']) == False]

In [None]:
# Reduce Repeatmasker to a set of non-overlapping coordinates
RM_LC_SR = interval_overlap_per_chrom(repeatmasker)

#### Modify genome fastas to remove Repeatmasker regions <a name="RMmodify"></a>
    make directory './hg38/chromosomes/masked/' for file output
[Return to Table of Contents](#TOC)

In [None]:
# Generate masked reference genome
coordinates_to_Ns(RM_LC_SR, reference_genome, 'transposons replaced with Ns', 'RMnosimple')

In [None]:
# Load masked reference genome
reference_genome_masked = dict()
for chrom in chr_list:
    with open('./hg38/chromosomes/masked/chr'+str(chrom)+'_mask_RMnosimple.fasta') as fasta_file:
        for sequence in SimpleFastaParser(fasta_file):
            chr_seq = sequence
        reference_genome_masked[chrom] = chr_seq[1]
        reference_genome_masked[chrom] = reference_genome_masked[chrom].upper()

# Generate database of repeats in hg38 <a name="DB"></a>

## Short Tandem Repeats <a name="DB_STR"></a>

### Define search features <a name="DB_STR_define"></a>

#### List of all possible N-mers <a name="DB_STR_define_allNmers"></a>

[Return to Table of Contents](#TOC)

In [None]:
bases = ['A', 'T', 'G', 'C']
dinucleotides = [base1+base2 for base1 in ['A', 'T', 'G', 'C'] for base2 in ['A', 'T', 'G', 'C'] if base1 != base2]
trinucleotides = [base1+base2+base3 for base1 in ['A', 'T', 'G', 'C'] for base2 in ['A', 'T', 'G', 'C'] for base3 in ['A', 'T', 'G', 'C'] if base1+base2+base3 not in ['AAA', 'TTT', 'GGG', 'CCC']]

exclude_4 = [base*4 for base in bases] + [di*2 for di in dinucleotides]
tetranucleotides = [base1+base2+base3+base4 for base1 in ['A', 'T', 'G', 'C'] for base2 in ['A', 'T', 'G', 'C'] for base3 in ['A', 'T', 'G', 'C'] for base4 in ['A', 'T', 'G', 'C'] if base1+base2+base3+base4 not in exclude_4]

exclude_5 = [base*5 for base in bases]
exclude_6 = [base*6 for base in bases] + [di*3 for di in dinucleotides] + [tri*2 for tri in trinucleotides]
exclude_7 = [base*7 for base in bases]
exclude_8 = [base*8 for base in bases] + [di*4 for di in dinucleotides] + [quad*2 for quad in tetranucleotides]
exclude_9 = [base*9 for base in bases] + [tri*3 for tri in trinucleotides]

nuc_5mer = [base1+base2+base3+base4+base5 for base1 in ['A', 'T', 'G', 'C'] for base2 in ['A', 'T', 'G', 'C'] for base3 in ['A', 'T', 'G', 'C'] for base4 in ['A', 'T', 'G', 'C'] for base5 in ['A', 'T', 'G', 'C'] if base1+base2+base3+base4+base5 not in exclude_5]
nuc_6mer = [base1+base2+base3+base4+base5+base6 for base1 in ['A', 'T', 'G', 'C'] for base2 in ['A', 'T', 'G', 'C'] for base3 in ['A', 'T', 'G', 'C'] for base4 in ['A', 'T', 'G', 'C'] for base5 in ['A', 'T', 'G', 'C'] for base6 in ['A', 'T', 'G', 'C'] if base1+base2+base3+base4+base5+base6 not in exclude_6]
nuc_7mer = [base1+base2+base3+base4+base5+base6+base7 for base1 in ['A', 'T', 'G', 'C'] for base2 in ['A', 'T', 'G', 'C'] for base3 in ['A', 'T', 'G', 'C'] for base4 in ['A', 'T', 'G', 'C'] for base5 in ['A', 'T', 'G', 'C'] for base6 in ['A', 'T', 'G', 'C'] for base7 in ['A', 'T', 'G', 'C'] if base1+base2+base3+base4+base5+base6+base7 not in exclude_7]
nuc_8mer = [base1+base2+base3+base4+base5+base6+base7+base8 for base1 in ['A', 'T', 'G', 'C'] for base2 in ['A', 'T', 'G', 'C'] for base3 in ['A', 'T', 'G', 'C'] for base4 in ['A', 'T', 'G', 'C'] for base5 in ['A', 'T', 'G', 'C'] for base6 in ['A', 'T', 'G', 'C'] for base7 in ['A', 'T', 'G', 'C'] for base8 in ['A', 'T', 'G', 'C'] if base1+base2+base3+base4+base5+base6+base7+base8 not in exclude_8]
nuc_9mer = [base1+base2+base3+base4+base5+base6+base7+base8+base9 for base1 in ['A', 'T', 'G', 'C'] for base2 in ['A', 'T', 'G', 'C'] for base3 in ['A', 'T', 'G', 'C'] for base4 in ['A', 'T', 'G', 'C'] for base5 in ['A', 'T', 'G', 'C'] for base6 in ['A', 'T', 'G', 'C'] for base7 in ['A', 'T', 'G', 'C'] for base8 in ['A', 'T', 'G', 'C'] for base9 in ['A', 'T', 'G', 'C'] if base1+base2+base3+base4+base5+base6+base7+base8+base9 not in exclude_9]

feature_names = [bases, dinucleotides, trinucleotides, tetranucleotides, nuc_5mer, nuc_6mer, nuc_7mer, nuc_8mer, nuc_9mer]
feature_seq = [[part*5 for part in bases], [part*3 for part in dinucleotides], [part*2 for part in trinucleotides], [part*2 for part in tetranucleotides], [part*2 for part in nuc_5mer], [part*2 for part in nuc_6mer], [part*2 for part in nuc_7mer], [part*2 for part in nuc_8mer], [part*2 for part in nuc_9mer]]

In [None]:
# number of potential different N-mers per value of N
[len(feature) for feature in feature_names]

#### Reduce all possible N-mers to nonredundant N-mers <a name="DB_STR_define_nonredundant"></a>

[Return to Table of Contents](#TOC)

In [None]:
exclude_pos = dict()
exclude_pos[1] = [seq[1]+seq[0] for seq in dinucleotides]
dinucleotides_nonredundant = [dinucleotides[pos] for pos in range(len(dinucleotides)) if dinucleotides[pos] not in exclude_pos[1][:pos] ]

exclude_pos = dict()
exclude_pos[1] = [seq[2]+seq[0]+seq[1] for seq in trinucleotides]
exclude_pos[2] = [seq[1]+seq[2]+seq[0] for seq in trinucleotides]
trinucleotides_nonredundant = [trinucleotides[pos] for pos in range(len(trinucleotides)) if (trinucleotides[pos] not in exclude_pos[1][:pos]) & (trinucleotides[pos] not in exclude_pos[2][:pos])]

exclude_pos = dict()
exclude_pos[1] = [seq[3]+seq[0]+seq[1]+seq[2] for seq in tetranucleotides]
exclude_pos[2] = [seq[2]+seq[3]+seq[0]+seq[1] for seq in tetranucleotides]
exclude_pos[3] = [seq[1]+seq[2]+seq[3]+seq[0] for seq in tetranucleotides]
tetranucleotides_nonredundant = [tetranucleotides[pos] for pos in range(len(tetranucleotides)) if (tetranucleotides[pos] not in exclude_pos[1][:pos]) & (tetranucleotides[pos] not in exclude_pos[2][:pos]) & (tetranucleotides[pos] not in exclude_pos[3][:pos])]

exclude_pos = dict()
exclude_pos[1] = [seq[4]+seq[0]+seq[1]+seq[2]+seq[3] for seq in nuc_5mer]
exclude_pos[2] = [seq[3]+seq[4]+seq[0]+seq[1]+seq[2] for seq in nuc_5mer]
exclude_pos[3] = [seq[2]+seq[3]+seq[4]+seq[0]+seq[1] for seq in nuc_5mer]
exclude_pos[4] = [seq[1]+seq[2]+seq[3]+seq[4]+seq[0] for seq in nuc_5mer]
nuc_5mer_nonredundant = [nuc_5mer[pos] for pos in range(len(nuc_5mer)) if (nuc_5mer[pos] not in exclude_pos[1][:pos]) & (nuc_5mer[pos] not in exclude_pos[2][:pos]) & (nuc_5mer[pos] not in exclude_pos[3][:pos]) & (nuc_5mer[pos] not in exclude_pos[4][:pos])]

exclude_pos = dict()
exclude_pos[1] = [seq[5]+seq[0]+seq[1]+seq[2]+seq[3]+seq[4] for seq in nuc_6mer]
exclude_pos[2] = [seq[4]+seq[5]+seq[0]+seq[1]+seq[2]+seq[3] for seq in nuc_6mer]
exclude_pos[3] = [seq[3]+seq[4]+seq[5]+seq[0]+seq[1]+seq[2] for seq in nuc_6mer]
exclude_pos[4] = [seq[2]+seq[3]+seq[4]+seq[5]+seq[0]+seq[1] for seq in nuc_6mer]
exclude_pos[5] = [seq[1]+seq[2]+seq[3]+seq[4]+seq[5]+seq[0] for seq in nuc_6mer]
nuc_6mer_nonredundant = [nuc_6mer[pos] for pos in range(len(nuc_6mer)) if (nuc_6mer[pos] not in exclude_pos[1][:pos]) & (nuc_6mer[pos] not in exclude_pos[2][:pos]) & (nuc_6mer[pos] not in exclude_pos[3][:pos]) & (nuc_6mer[pos] not in exclude_pos[4][:pos]) & (nuc_6mer[pos] not in exclude_pos[5][:pos])]

#### Reduce search space for high-N-mers by first searching for any N-mers that repeat at least once <a name="DB_STR_define_reduce"></a>

[Return to Table of Contents](#TOC)

In [None]:
rep_7mers = dict()
for chrom in range(chr_range,23):
    rep_7mers[chrom] = dict()
    for pos in range(7):
        rep_7mers[chrom][pos] = pd.DataFrame(re.findall('.'*7, reference_genome_masked[chrom][pos:]), columns = ['7mer'])
        rep_7mers[chrom][pos]['start'] = rep_7mers[chrom][pos].index*7
        rep_7mers[chrom][pos]['shift'] = rep_7mers[chrom][pos]['7mer'].shift(-1)
        rep_7mers[chrom][pos] = rep_7mers[chrom][pos].loc[(rep_7mers[chrom][pos]['7mer'] == rep_7mers[chrom][pos]['shift']) & (rep_7mers[chrom][pos]['7mer'] != 'N'*7)]
    rep_7mers[chrom] = pd.concat(rep_7mers[chrom])
    print('\r' + '7-mer chr' + str(chrom), end='        ')

rep_7mer_counts_all = pd.Series(0, index = nuc_7mer)
for chrom in range(chr_range,23):
    rep_7mer_counts_all += pd.Series(rep_7mers[chrom]['7mer'].value_counts(), index = nuc_7mer)
rep_7mer_counts_all = rep_7mer_counts_all.dropna()

exclude_pos = dict()
exclude_pos[1] = [seq[6]+seq[0]+seq[1]+seq[2]+seq[3]+seq[4]+seq[5] for seq in rep_7mer_counts_all.index]
exclude_pos[2] = [seq[5]+seq[6]+seq[0]+seq[1]+seq[2]+seq[3]+seq[4] for seq in rep_7mer_counts_all.index]
exclude_pos[3] = [seq[4]+seq[5]+seq[6]+seq[0]+seq[1]+seq[2]+seq[3] for seq in rep_7mer_counts_all.index]
exclude_pos[4] = [seq[3]+seq[4]+seq[5]+seq[6]+seq[0]+seq[1]+seq[2] for seq in rep_7mer_counts_all.index]
exclude_pos[5] = [seq[2]+seq[3]+seq[4]+seq[5]+seq[6]+seq[0]+seq[1] for seq in rep_7mer_counts_all.index]
exclude_pos[6] = [seq[1]+seq[2]+seq[3]+seq[4]+seq[5]+seq[6]+seq[0] for seq in rep_7mer_counts_all.index]
nuc_7mer_nonredundant = [rep_7mer_counts_all.index[pos] for pos in range(len(rep_7mer_counts_all.index)) if (rep_7mer_counts_all.index[pos] not in exclude_pos[1][:pos]) & (rep_7mer_counts_all.index[pos] not in exclude_pos[2][:pos]) & (rep_7mer_counts_all.index[pos] not in exclude_pos[3][:pos]) & (rep_7mer_counts_all.index[pos] not in exclude_pos[4][:pos]) & (rep_7mer_counts_all.index[pos] not in exclude_pos[5][:pos]) & (rep_7mer_counts_all.index[pos] not in exclude_pos[6][:pos])]

rep_8mers = dict()
for chrom in range(chr_range,23):
    rep_8mers[chrom] = dict()
    for pos in range(8):
        rep_8mers[chrom][pos] = pd.DataFrame(re.findall('.'*8, reference_genome_masked[chrom][pos:]), columns = ['8mer'])
        rep_8mers[chrom][pos]['start'] = rep_8mers[chrom][pos].index*8
        rep_8mers[chrom][pos]['shift'] = rep_8mers[chrom][pos]['8mer'].shift(-1)
        rep_8mers[chrom][pos] = rep_8mers[chrom][pos].loc[(rep_8mers[chrom][pos]['8mer'] == rep_8mers[chrom][pos]['shift']) & (rep_8mers[chrom][pos]['8mer'] != 'N'*8)]
    rep_8mers[chrom] = pd.concat(rep_8mers[chrom])
    print('\r' + '8-mer chr' + str(chrom), end='        ')

rep_8mer_counts_all = pd.Series(0, index = nuc_8mer)
for chrom in range(chr_range,23):
    rep_8mer_counts_all += pd.Series(rep_8mers[chrom]['8mer'].value_counts(), index = nuc_8mer)
rep_8mer_counts_all = rep_8mer_counts_all.dropna()

exclude_pos = dict()
exclude_pos[1] = [seq[7]+seq[0]+seq[1]+seq[2]+seq[3]+seq[4]+seq[5]+seq[6] for seq in rep_8mer_counts_all.index]
exclude_pos[2] = [seq[6]+seq[7]+seq[0]+seq[1]+seq[2]+seq[3]+seq[4]+seq[5] for seq in rep_8mer_counts_all.index]
exclude_pos[3] = [seq[5]+seq[6]+seq[7]+seq[0]+seq[1]+seq[2]+seq[3]+seq[4] for seq in rep_8mer_counts_all.index]
exclude_pos[4] = [seq[4]+seq[5]+seq[6]+seq[7]+seq[0]+seq[1]+seq[2]+seq[3] for seq in rep_8mer_counts_all.index]
exclude_pos[5] = [seq[3]+seq[4]+seq[5]+seq[6]+seq[7]+seq[0]+seq[1]+seq[2] for seq in rep_8mer_counts_all.index]
exclude_pos[6] = [seq[2]+seq[3]+seq[4]+seq[5]+seq[6]+seq[7]+seq[0]+seq[1] for seq in rep_8mer_counts_all.index]
exclude_pos[7] = [seq[1]+seq[2]+seq[3]+seq[4]+seq[5]+seq[6]+seq[7]+seq[0] for seq in rep_8mer_counts_all.index]
nuc_8mer_nonredundant = [rep_8mer_counts_all.index[pos] for pos in range(len(rep_8mer_counts_all.index)) if (rep_8mer_counts_all.index[pos] not in exclude_pos[1][:pos]) & (rep_8mer_counts_all.index[pos] not in exclude_pos[2][:pos]) & (rep_8mer_counts_all.index[pos] not in exclude_pos[3][:pos]) & (rep_8mer_counts_all.index[pos] not in exclude_pos[4][:pos]) & (rep_8mer_counts_all.index[pos] not in exclude_pos[5][:pos]) & (rep_8mer_counts_all.index[pos] not in exclude_pos[6][:pos]) & (rep_8mer_counts_all.index[pos] not in exclude_pos[7][:pos])]

rep_9mers = dict()
for chrom in range(chr_range,23):
    rep_9mers[chrom] = dict()
    for pos in range(9):
        rep_9mers[chrom][pos] = pd.DataFrame(re.findall('.'*9, reference_genome_masked[chrom][pos:]), columns = ['9mer'])
        rep_9mers[chrom][pos]['start'] = rep_9mers[chrom][pos].index*9
        rep_9mers[chrom][pos]['shift'] = rep_9mers[chrom][pos]['9mer'].shift(-1)
        rep_9mers[chrom][pos] = rep_9mers[chrom][pos].loc[(rep_9mers[chrom][pos]['9mer'] == rep_9mers[chrom][pos]['shift']) & (rep_9mers[chrom][pos]['9mer'] != 'N'*9)]
    rep_9mers[chrom] = pd.concat(rep_9mers[chrom])
    print('\r' + '9-mer chr' + str(chrom), end='        ')

rep_9mer_counts_all = pd.Series(0, index = nuc_9mer)
for chrom in range(chr_range,23):
    rep_9mer_counts_all += pd.Series(rep_9mers[chrom]['9mer'].value_counts(), index = nuc_9mer)
rep_9mer_counts_all = rep_9mer_counts_all.dropna()

exclude_pos = dict()
exclude_pos[1] = [seq[8]+seq[0]+seq[1]+seq[2]+seq[3]+seq[4]+seq[5]+seq[6]+seq[7] for seq in rep_9mer_counts_all.index]
exclude_pos[2] = [seq[7]+seq[8]+seq[0]+seq[1]+seq[2]+seq[3]+seq[4]+seq[5]+seq[6] for seq in rep_9mer_counts_all.index]
exclude_pos[3] = [seq[6]+seq[7]+seq[8]+seq[0]+seq[1]+seq[2]+seq[3]+seq[4]+seq[5] for seq in rep_9mer_counts_all.index]
exclude_pos[4] = [seq[5]+seq[6]+seq[7]+seq[8]+seq[0]+seq[1]+seq[2]+seq[3]+seq[4] for seq in rep_9mer_counts_all.index]
exclude_pos[5] = [seq[4]+seq[5]+seq[6]+seq[7]+seq[8]+seq[0]+seq[1]+seq[2]+seq[3] for seq in rep_9mer_counts_all.index]
exclude_pos[6] = [seq[3]+seq[4]+seq[5]+seq[6]+seq[7]+seq[8]+seq[0]+seq[1]+seq[2] for seq in rep_9mer_counts_all.index]
exclude_pos[7] = [seq[2]+seq[3]+seq[4]+seq[5]+seq[6]+seq[7]+seq[8]+seq[0]+seq[1] for seq in rep_9mer_counts_all.index]
exclude_pos[8] = [seq[1]+seq[2]+seq[3]+seq[4]+seq[5]+seq[6]+seq[7]+seq[8]+seq[0] for seq in rep_9mer_counts_all.index]
nuc_9mer_nonredundant = [rep_9mer_counts_all.index[pos] for pos in range(len(rep_9mer_counts_all.index)) if (rep_9mer_counts_all.index[pos] not in exclude_pos[1][:pos]) & (rep_9mer_counts_all.index[pos] not in exclude_pos[2][:pos]) & (rep_9mer_counts_all.index[pos] not in exclude_pos[3][:pos]) & (rep_9mer_counts_all.index[pos] not in exclude_pos[4][:pos]) & (rep_9mer_counts_all.index[pos] not in exclude_pos[5][:pos]) & (rep_9mer_counts_all.index[pos] not in exclude_pos[6][:pos]) & (rep_9mer_counts_all.index[pos] not in exclude_pos[7][:pos]) & (rep_9mer_counts_all.index[pos] not in exclude_pos[8][:pos])]

In [None]:
feature_names = [bases, dinucleotides_nonredundant, trinucleotides_nonredundant, tetranucleotides_nonredundant, nuc_5mer_nonredundant, nuc_6mer_nonredundant, nuc_7mer_nonredundant, nuc_8mer_nonredundant, nuc_9mer_nonredundant]

In [None]:
# Save temporary file: list of STR units to search
pd.Series(feature_names).to_pickle('./custom_db/temp/STR_units_in_genome_chr'+str(chr_range)+'-22.pickle')

In [None]:
# Load temporary file: list of STR units to search
feature_names = list(pd.read_pickle('./custom_db/temp/STR_units_in_genome_chr'+str(chr_range)+'-22.pickle'))

In [None]:
# Number of actual different N-mers per value of N to search
[len(feature) for feature in feature_names]

### Search, including overlaps <a name="DB_STR_search"></a>

[Return to Table of Contents](#TOC)

In [None]:
# Define minimum STR length for each N-mer size:
# 5 units for N=1, 3 units for N=2, 2 units for N>=3
feature_seq = [[part*5 for part in feature_names[0]], [part*3 for part in feature_names[1]], [part*2 for part in feature_names[2]], [part*2 for part in feature_names[3]], [part*2 for part in feature_names[4]], [part*2 for part in feature_names[5]], [part*2 for part in feature_names[6]], [part*2 for part in feature_names[7]], [part*2 for part in feature_names[8]]]

In [None]:
found_STRs = dict()

for seq, name in zip(feature_seq[0], feature_names[0]):
    found_STRs[name] = dict()
    for chrom in range(chr_range,23):
        found_STRs[name][chrom] = pd.DataFrame()
        current_iter = re.finditer(seq, reference_genome_masked[chrom], overlapped = True)
        current_indices = [match.start(0) for match in current_iter]
        found_STRs[name][chrom]['start'] = current_indices
        found_STRs[name][chrom]['end'] = found_STRs[name][chrom]['start'] +len(seq)
        found_STRs[name][chrom]['chrom'] = chrom
        print('\r' + name + ' chr' + str(chrom), end='        ')

for seq, name in zip(feature_seq[1], feature_names[1]):
    found_STRs[name] = dict()
    for chrom in range(chr_range,23):
        found_STRs[name][chrom] = pd.DataFrame()
        current_iter = re.finditer(seq, reference_genome_masked[chrom], overlapped = True)
        current_indices = [match.start(0) for match in current_iter]
        found_STRs[name][chrom]['start'] = current_indices
        found_STRs[name][chrom]['end'] = found_STRs[name][chrom]['start'] +len(seq)
        found_STRs[name][chrom]['chrom'] = chrom
        print('\r' + name + ' chr' + str(chrom), end='        ')

counter = 0
for seq, name in zip(feature_seq[2], feature_names[2]):
    found_STRs[name] = dict()
    counter +=1
    for chrom in range(chr_range,23):
        found_STRs[name][chrom] = pd.DataFrame()
        current_iter = re.finditer(seq, reference_genome_masked[chrom], overlapped = True)
        current_indices = [match.start(0) for match in current_iter]
        found_STRs[name][chrom]['start'] = current_indices
        found_STRs[name][chrom]['end'] = found_STRs[name][chrom]['start'] +len(seq)
        found_STRs[name][chrom]['chrom'] = chrom
        print('\r' + name + ' chr' + str(chrom) + ' ' + str(counter) + '/' + str(len(feature_names[2])), end='        ')

counter = 0
for seq, name in zip(feature_seq[3], feature_names[3]):
    found_STRs[name] = dict()
    counter +=1
    for chrom in range(chr_range,23):
        found_STRs[name][chrom] = pd.DataFrame()
        current_iter = re.finditer(seq, reference_genome_masked[chrom], overlapped = True)
        current_indices = [match.start(0) for match in current_iter]
        found_STRs[name][chrom]['start'] = current_indices
        found_STRs[name][chrom]['end'] = found_STRs[name][chrom]['start'] +len(seq)
        found_STRs[name][chrom]['chrom'] = chrom
        print('\r' + name + ' chr' + str(chrom) + ' ' + str(counter) + '/' + str(len(feature_names[3])), end='        ')

counter = 0
for seq, name in zip(feature_seq[4], feature_names[4]):
    found_STRs[name] = dict()
    counter +=1
    for chrom in range(chr_range,23):
        found_STRs[name][chrom] = pd.DataFrame()
        current_iter = re.finditer(seq, reference_genome_masked[chrom], overlapped = True)
        current_indices = [match.start(0) for match in current_iter]
        found_STRs[name][chrom]['start'] = current_indices
        found_STRs[name][chrom]['end'] = found_STRs[name][chrom]['start'] +len(seq)
        found_STRs[name][chrom]['chrom'] = chrom
        print('\r' + name + ' chr' + str(chrom) + ' ' + str(counter) + '/' + str(len(feature_names[4])), end='        ')

counter = 0
for seq, name in zip(feature_seq[5], feature_names[5]):
    found_STRs[name] = dict()
    counter +=1
    for chrom in range(chr_range,23):
        found_STRs[name][chrom] = pd.DataFrame()
        current_iter = re.finditer(seq, reference_genome_masked[chrom], overlapped = True)
        current_indices = [match.start(0) for match in current_iter]
        found_STRs[name][chrom]['start'] = current_indices
        found_STRs[name][chrom]['end'] = found_STRs[name][chrom]['start'] +len(seq)
        found_STRs[name][chrom]['chrom'] = chrom
        print('\r' + name + ' chr' + str(chrom) + ' ' + str(counter) + '/' + str(len(feature_names[5])), end='        ')

counter = 0
for seq, name in zip(feature_seq[6], feature_names[6]):
    found_STRs[name] = dict()
    counter +=1
    for chrom in range(chr_range,23):
        found_STRs[name][chrom] = pd.DataFrame()
        current_iter = re.finditer(seq, reference_genome_masked[chrom], overlapped = True)
        current_indices = [match.start(0) for match in current_iter]
        found_STRs[name][chrom]['start'] = current_indices
        found_STRs[name][chrom]['end'] = found_STRs[name][chrom]['start'] +len(seq)
        found_STRs[name][chrom]['chrom'] = chrom
        print('\r' + name + ' chr' + str(chrom) + ' ' + str(counter) + '/' + str(len(feature_names[6])), end='        ')

counter = 0
for seq, name in zip(feature_seq[7], feature_names[7]):
    found_STRs[name] = dict()
    counter +=1
    for chrom in range(chr_range,23):
        found_STRs[name][chrom] = pd.DataFrame()
        current_iter = re.finditer(seq, reference_genome_masked[chrom], overlapped = True)
        current_indices = [match.start(0) for match in current_iter]
        found_STRs[name][chrom]['start'] = current_indices
        found_STRs[name][chrom]['end'] = found_STRs[name][chrom]['start'] +len(seq)
        found_STRs[name][chrom]['chrom'] = chrom
        print('\r' + name + ' chr' + str(chrom) + ' ' + str(counter) + '/' + str(len(feature_names[7])), end='        ')

counter = 0
for seq, name in zip(feature_seq[8], feature_names[8]):
    found_STRs[name] = dict()
    counter +=1
    for chrom in range(chr_range,23):
        found_STRs[name][chrom] = pd.DataFrame()
        current_iter = re.finditer(seq, reference_genome_masked[chrom], overlapped = True)
        current_indices = [match.start(0) for match in current_iter]
        found_STRs[name][chrom]['start'] = current_indices
        found_STRs[name][chrom]['end'] = found_STRs[name][chrom]['start'] +len(seq)
        found_STRs[name][chrom]['chrom'] = chrom
        print('\r' + name + ' chr' + str(chrom) + ' ' + str(counter) + '/' + str(len(feature_names[8])), end='        ')

In [None]:
# Remove empties
for name in found_STRs:
    found_STRs[name] = pd.concat(found_STRs[name])
found_STRs_non0 = dict()
for name in found_STRs:
    if len(found_STRs[name]) != 0:
        found_STRs_non0[name] = found_STRs[name]
found_STRs = found_STRs_non0

### Expand by full repeat unit <a name="DB_STR_expand_full"></a>
[Return to Table of Contents](#TOC)

In [None]:
# Search right
found_STRs_expand_R = dict()
for name in found_STRs:
    found_STRs_expand_R[name] = dict()
    exp_len = 0
    found_STRs_expand_R[name][exp_len] = found_STRs[name].astype(int).copy()
    found_STRs_expand_R[name][exp_len]['right+1'] = [reference_genome_masked[chrom][pos:pos+len(name)] for chrom, pos in zip(found_STRs_expand_R[name][exp_len]['chrom'], found_STRs_expand_R[name][exp_len]['end'])]
    remaining_len = 1
    found_STRs_expand_R[name][exp_len+1] = found_STRs_expand_R[name][exp_len].loc[found_STRs_expand_R[name][exp_len]['right+1'] == name]
    found_STRs_expand_R[name][exp_len] = found_STRs_expand_R[name][exp_len].loc[found_STRs_expand_R[name][exp_len]['right+1'] != name]
    while remaining_len >0:
        exp_len += 1
        found_STRs_expand_R[name][exp_len]['end'] += len(name)
        found_STRs_expand_R[name][exp_len]['right+1'] = [reference_genome_masked[chrom][pos:pos+len(name)] for chrom, pos in zip(found_STRs_expand_R[name][exp_len]['chrom'], found_STRs_expand_R[name][exp_len]['end'])]
        found_STRs_expand_R[name][exp_len+1] = found_STRs_expand_R[name][exp_len].loc[found_STRs_expand_R[name][exp_len]['right+1'] == name]
        found_STRs_expand_R[name][exp_len] = found_STRs_expand_R[name][exp_len].loc[found_STRs_expand_R[name][exp_len]['right+1'] != name]
        remaining_len = len(found_STRs_expand_R[name][exp_len])
        print('\r' + name + ' ' + str(remaining_len), end='        ')

In [None]:
# Search left
found_STRs_expand_L = dict()
for name in found_STRs:
    found_STRs_expand_L[name] = dict()
    exp_len = 0
    found_STRs_expand_L[name][exp_len] = pd.concat(found_STRs_expand_R[name])
    found_STRs_expand_L[name][exp_len]['left-1'] = [reference_genome_masked[chrom][pos-len(name):pos] for chrom, pos in zip(found_STRs_expand_L[name][exp_len]['chrom'], found_STRs_expand_L[name][exp_len]['start'])]
    remaining_len = 1
    found_STRs_expand_L[name][exp_len+1] = found_STRs_expand_L[name][exp_len].loc[found_STRs_expand_L[name][exp_len]['left-1'] == name]
    found_STRs_expand_L[name][exp_len] = found_STRs_expand_L[name][exp_len].loc[found_STRs_expand_L[name][exp_len]['left-1'] != name]
    while remaining_len >0:
        exp_len += 1
        found_STRs_expand_L[name][exp_len]['start'] -= len(name)
        found_STRs_expand_L[name][exp_len]['left-1'] = [reference_genome_masked[chrom][pos-len(name):pos] for chrom, pos in zip(found_STRs_expand_L[name][exp_len]['chrom'], found_STRs_expand_L[name][exp_len]['start'])]
        found_STRs_expand_L[name][exp_len+1] = found_STRs_expand_L[name][exp_len].loc[found_STRs_expand_L[name][exp_len]['left-1'] == name]
        found_STRs_expand_L[name][exp_len] = found_STRs_expand_L[name][exp_len].loc[found_STRs_expand_L[name][exp_len]['left-1'] != name]
        remaining_len = len(found_STRs_expand_L[name][exp_len])
        print('\r' + name + ' ' + str(remaining_len), end='        ')

In [None]:
# Remove dupicates
for name in found_STRs:
    found_STRs_expand_L[name] = pd.concat(found_STRs_expand_L[name])
    found_STRs_expand_L[name] = found_STRs_expand_L[name].drop_duplicates(subset = ['chrom', 'start', 'end'])[['chrom', 'start', 'end']].sort_values(by = ['chrom', 'start', 'end']).reset_index(drop = True)

### Expand partial repeat units to get longest perfect STRs <a name="DB_STR_expand_partial"></a>
[Return to Table of Contents](#TOC)

In [None]:
# Search right
found_STRs_expand_partial_R = dict()
for name in found_STRs:
    found_STRs_expand_partial_R[name] = dict()
    found_STRs_expand_partial_R[name][0] = found_STRs_expand_L[name].copy()
    found_STRs_expand_partial_R[name][0]['right+1'] = [reference_genome_masked[chrom][pos:pos+len(name)] for chrom, pos in zip(found_STRs_expand_partial_R[name][0]['chrom'], found_STRs_expand_partial_R[name][0]['end'])]
    found_STRs_expand_partial_R[name][0]['partial_len_R'] = 0
    for pos in range(1, len(name)):
        found_STRs_expand_partial_R[name][pos] = found_STRs_expand_partial_R[name][pos-1].loc[found_STRs_expand_partial_R[name][pos-1]['right+1'].str[:pos] == name[:pos]].copy()
        found_STRs_expand_partial_R[name][pos]['partial_len_R'] = pos
        found_STRs_expand_partial_R[name][pos-1] = found_STRs_expand_partial_R[name][pos-1].loc[found_STRs_expand_partial_R[name][pos-1]['right+1'].str[:pos] != name[:pos]].copy()
        found_STRs_expand_partial_R[name][pos-1]['partial_len_R'] = pos-1
    print('\r' + name, end='        ')

In [None]:
# Search left
found_STRs_expand_partial_L = dict()
for name in found_STRs:
    found_STRs_expand_partial_L[name] = dict()
    found_STRs_expand_partial_L[name][0] = pd.concat(found_STRs_expand_partial_R[name]).copy()
    found_STRs_expand_partial_L[name][0]['left-1'] = [reference_genome_masked[chrom][pos-len(name):pos] for chrom, pos in zip(found_STRs_expand_partial_L[name][0]['chrom'], found_STRs_expand_partial_L[name][0]['start'])]
    found_STRs_expand_partial_L[name][0]['partial_len_L'] = 0
    for pos in range(1, len(name)):
        found_STRs_expand_partial_L[name][pos] = found_STRs_expand_partial_L[name][pos-1].loc[found_STRs_expand_partial_L[name][pos-1]['left-1'].str[-pos:] == name[-pos:]].copy()
        found_STRs_expand_partial_L[name][pos]['partial_len_L'] = pos
        found_STRs_expand_partial_L[name][pos-1] = found_STRs_expand_partial_L[name][pos-1].loc[found_STRs_expand_partial_L[name][pos-1]['left-1'].str[-pos:] != name[-pos:]].copy()
        found_STRs_expand_partial_L[name][pos-1]['partial_len_L'] = pos-1
    print('\r' + name, end='        ')

In [None]:
# Remove duplicates
for name in found_STRs_expand_partial_L:
    found_STRs_expand_partial_L[name] = pd.concat(found_STRs_expand_partial_L[name])
    found_STRs_expand_partial_L[name] = found_STRs_expand_partial_L[name].sort_values(by = ['chrom', 'start', 'end']).drop_duplicates(subset = ['chrom', 'start', 'end']).reset_index(drop = True)[['chrom', 'start', 'end', 'partial_len_L', 'partial_len_R']]

In [None]:
# Generate annotations
for name in found_STRs_expand_partial_L:
    found_STRs_expand_partial_L[name]['full_start'] = found_STRs_expand_partial_L[name]['start']
    found_STRs_expand_partial_L[name]['full_end'] = found_STRs_expand_partial_L[name]['end']
    found_STRs_expand_partial_L[name]['start'] = found_STRs_expand_partial_L[name]['start'] - found_STRs_expand_partial_L[name]['partial_len_L']
    found_STRs_expand_partial_L[name]['end'] = found_STRs_expand_partial_L[name]['end'] + found_STRs_expand_partial_L[name]['partial_len_R']
    found_STRs_expand_partial_L[name]['seq'] = [reference_genome_masked[chrom][start:end] for chrom, start, end in zip(found_STRs_expand_partial_L[name]['chrom'], found_STRs_expand_partial_L[name]['start'], found_STRs_expand_partial_L[name]['end'])]
    found_STRs_expand_partial_L[name]['length'] = found_STRs_expand_partial_L[name]['end'] - found_STRs_expand_partial_L[name]['start']
    found_STRs_expand_partial_L[name]['n_units'] = found_STRs_expand_partial_L[name]['length'] / len(name)

### Find imperfect STRs <a name="DB_STR_imperfect"></a>

#### Overlap coordinates, allowing for up to one repeat unit of imperfection <a name="DB_STR_imperfect_overlap"></a>

[Return to Table of Contents](#TOC)

In [None]:
# Function to merge any overlapping windows, merge across interruptions
def interval_overlap_per_chrom_interruption(df_in, interruption_len):
    df_out = dict()
    for chrom in range(chr_range,23):
        df_current = df_in.loc[df_in['chrom'] == chrom]
        if len(df_current) > 0:
            range_df = pd.DataFrame()            
            range_df['start'] = sorted(list(df_current['start'] - interruption_len))
            range_df['end'] = sorted(list(df_current['end'] + interruption_len))
            
            range_df_shift = pd.DataFrame()
            range_df_shift['start'] = range_df['start'][1:].reset_index(drop = True)
            range_df_shift['end'] = range_df['end'][:-1].reset_index(drop = True)
            range_df_shift['overlap'] = range_df_shift['start'] > range_df_shift['end']
            range_df_shift = range_df_shift.loc[range_df_shift['overlap'] == True]

            range_df_merged = pd.DataFrame()
            range_df_merged['start'] = [range_df['start'][0]] +list(range_df_shift['start'])
            range_df_merged['end'] = list(range_df_shift['end']) + [range_df['end'][len(range_df.index)-1]]

            range_df = range_df_merged
            range_df['chrom'] = chrom

            range_df['start'] = range_df['start'] + interruption_len
            range_df['end'] = range_df['end'] - interruption_len

            df_out[chrom] = range_df
    df_out = pd.concat(df_out).reset_index(drop = True)
    return df_out

In [None]:
# Find interruptions
found_STRs_interruptions = dict()
for name in found_STRs_expand_partial_L:
    found_STRs_interruptions[name] = interval_overlap_per_chrom_interruption(found_STRs_expand_partial_L[name], len(name))
    print('\r' + name, end='        ')

# Add annotations
for name in found_STRs_interruptions:
    found_STRs_interruptions[name]['start'] = found_STRs_interruptions[name]['start'].astype(int)
    found_STRs_interruptions[name]['end'] = found_STRs_interruptions[name]['end'].astype(int)
    found_STRs_interruptions[name]['length'] = found_STRs_interruptions[name]['end'] - found_STRs_interruptions[name]['start']
    found_STRs_interruptions[name]['length'] = found_STRs_interruptions[name]['length'].astype(int)
    found_STRs_interruptions[name]['repeat'] = name
    found_STRs_interruptions[name]['Sequence'] = [reference_genome_masked[chrom][start:end] for chrom, start, end in zip(found_STRs_interruptions[name]['chrom'], found_STRs_interruptions[name]['start'], found_STRs_interruptions[name]['end'])]
    found_STRs_interruptions[name]['n_units'] = found_STRs_interruptions[name]['length'] / len(name)

#### Find location of interruptions within each STR <a name="DB_STR_imperfect_location"></a>

[Return to Table of Contents](#TOC)

In [None]:
for name in found_STRs_interruptions:
    found_STRs_interruptions[name]['repeat_frame_L'] = [seq[:len(name)] for seq in found_STRs_interruptions[name]['Sequence']]
    found_STRs_interruptions[name]['repeat_frame_R'] = [seq[-len(name):] for seq in found_STRs_interruptions[name]['Sequence']]

for name in found_STRs_interruptions:
    found_STRs_interruptions[name]['rep_pos_L'] = [[match.start(0) for match in re.finditer(frame, seq, overlapped = True)] for frame, seq in zip(found_STRs_interruptions[name]['repeat_frame_L'], found_STRs_interruptions[name]['Sequence'])]
    found_STRs_interruptions[name]['rep_pos_R'] = [[match.start(0) for match in re.finditer(frame, seq, overlapped = True)] for frame, seq in zip(found_STRs_interruptions[name]['repeat_frame_R'], found_STRs_interruptions[name]['Sequence'])]
    print('\r' + name, end='        ')

for name in found_STRs_interruptions:
    found_STRs_interruptions[name]['n_full_units_L'] = [len(matches) for matches in found_STRs_interruptions[name]['rep_pos_L']]
    found_STRs_interruptions[name]['n_full_units_R'] = [len(matches) for matches in found_STRs_interruptions[name]['rep_pos_R']]

for name in found_STRs_interruptions:
    found_STRs_interruptions[name]['rep_pos_inline_L'] = [[pos for pos in matches if pos%len(name) != 0] for matches in found_STRs_interruptions[name]['rep_pos_L']]
    found_STRs_interruptions[name]['rep_pos_inline_R'] = [[pos for pos in matches if pos%len(name) != length%len(name)] for matches, length in zip(found_STRs_interruptions[name]['rep_pos_R'], found_STRs_interruptions[name]['length'])]

#### Separate perfect vs imperfect STRs <a name="DB_STR_imperfect_separate"></a>

- perfect STRs: all unit positions divisible by unit length, and total length is within 1 unit length of repeat length
- in-frame mutations: no out of frame positions from L and R ends, and total length is not within 1 unit length of repeat length
- short indels: out of frame positions, and total length is within 1 unit length of repeat length
- complex: out of frame positions, and total length is not within 1 unit length of repeat length


[Return to Table of Contents](#TOC)

In [None]:
found_STRs_perfect = dict()
found_STRs_inframe = dict()
found_STRs_shortindel = dict()
found_STRs_complex = dict()
for name in found_STRs_interruptions:
    found_STRs_perfect[name] = found_STRs_interruptions[name].loc[(found_STRs_interruptions[name]['rep_pos_inline_L'].str.len() == 0) & (found_STRs_interruptions[name]['rep_pos_inline_R'].str.len() == 0) & (found_STRs_interruptions[name]['length'] - (found_STRs_interruptions[name]['n_full_units_L'] * len(name)) < len(name)) & (found_STRs_interruptions[name]['n_full_units_L'] >=3)].copy()
    found_STRs_inframe[name] = found_STRs_interruptions[name].loc[(found_STRs_interruptions[name]['rep_pos_inline_L'].str.len() == 0) & (found_STRs_interruptions[name]['rep_pos_inline_R'].str.len() == 0) & (found_STRs_interruptions[name]['length'] - (found_STRs_interruptions[name]['n_full_units_L'] * len(name)) >= len(name)) & (found_STRs_interruptions[name]['n_full_units_L'] >=3)].copy()
    found_STRs_shortindel[name] = found_STRs_interruptions[name].loc[((found_STRs_interruptions[name]['rep_pos_inline_L'].str.len() != 0) | (found_STRs_interruptions[name]['rep_pos_inline_R'].str.len() != 0)) & ((found_STRs_interruptions[name]['length'] - (found_STRs_interruptions[name]['n_full_units_L'] * len(name)) < len(name)) | (found_STRs_interruptions[name]['length'] - (found_STRs_interruptions[name]['n_full_units_R'] * len(name)) < len(name))) & (found_STRs_interruptions[name]['n_full_units_L'] >=3)].copy()
    found_STRs_complex[name] = found_STRs_interruptions[name].loc[((found_STRs_interruptions[name]['rep_pos_inline_L'].str.len() != 0) | (found_STRs_interruptions[name]['rep_pos_inline_R'].str.len() != 0)) & (found_STRs_interruptions[name]['length'] - (found_STRs_interruptions[name]['n_full_units_L'] * len(name)) >= len(name)) & (found_STRs_interruptions[name]['length'] - (found_STRs_interruptions[name]['n_full_units_R'] * len(name)) >= len(name)) & (found_STRs_interruptions[name]['n_full_units_L'] >=3)].copy()

### Save/load  <a name="DB_STR_imperfect_save"></a>
[Return to Table of Contents](#TOC)

In [None]:
# Combine
for name in found_STRs:
    found_STRs_perfect[name]['status'] = 'perfect'
    found_STRs_inframe[name]['status'] = 'inframe'
    found_STRs_shortindel[name]['status'] = 'shortindel'
    found_STRs_complex[name]['status'] = 'complex'

all_STRs = pd.concat([pd.concat(found_STRs_perfect), pd.concat(found_STRs_inframe), pd.concat(found_STRs_shortindel), pd.concat(found_STRs_complex)])
all_STRs = all_STRs.sort_values(by = ['chrom', 'start', 'end', 'repeat']).reset_index(drop = True)

In [None]:
# Assign strand to reverse-complementary motifs

# STR motifs with enough power to analyze
repeats_highpower = ['T', 'G', 'TG', 'TC', 'TGG', 'ATT', 'TTG', 'TTC', 'TCC', 'ATG', 'TGC', 'ATTT', 'A', 'C',  'AC', 'AG', 'ACC', 'AAT', 'AAC', 'AAG', 'AGG', 'ATC', 'AGC', 'AAAT', 'AT', 'GC']
# Symmetric and asymmetric STR motifs
repeats_highpower_asym = pd.Series(['T', 'G', 'TG', 'TC', 'TGG', 'ATT', 'TTG', 'TTC', 'TCC', 'ATG', 'TGC', 'ATTT'], index = ['A', 'C',  'AC', 'AG', 'ACC', 'AAT', 'AAC', 'AAG', 'AGG', 'ATC', 'AGC', 'AAAT'])
repeats_highpower_sym = ['AT', 'GC']

In [None]:
all_STRs_strand_F = all_STRs.loc[all_STRs['repeat'].isin(repeats_highpower_asym.index)].copy()
all_STRs_strand_F['Strand'] = '+'
all_STRs_strand_R = all_STRs.loc[all_STRs['repeat'].isin(repeats_highpower_asym)].copy()
all_STRs_strand_R['Strand'] = '-'
all_STRs_strand_na = all_STRs.loc[~(all_STRs['repeat'].isin(repeats_highpower_asym.index)) & ~(all_STRs['repeat'].isin(repeats_highpower_asym))].copy()

all_STRs = pd.concat([all_STRs_strand_F, all_STRs_strand_R, all_STRs_strand_na]).sort_values(by = ['chrom', 'start', 'end', 'repeat'])

In [None]:
# Save 
all_STRs.to_csv('./custom_db/temp/STRs_custom_imperfect_1-9_chr'+str(chr_range)+'-22.csv.gz', compression = 'gzip', index = False)

In [None]:
# Load
all_STRs = pd.read_csv('./custom_db/temp/STRs_custom_imperfect_1-9_chr'+str(chr_range)+'-22.csv.gz', compression = 'gzip')

### Mask reference genome with STRs <a name="DB_STR_mask"></a>
[Return to Table of Contents](#TOC)

In [None]:
# Generate masked reference genome
coordinates_to_Ns(all_STRs, reference_genome_masked, 'transposons and STRs replaced with Ns', 'RMnosimple_STR')

In [None]:
# Load masked reference genome
reference_genome_masked_STR = dict()
for chrom in chr_list:
    with open('./hg38/chromosomes/masked/chr'+str(chrom)+'_mask_RMnosimple_STR.fasta') as fasta_file:
        for sequence in SimpleFastaParser(fasta_file):
            chr_seq = sequence
        reference_genome_masked_STR[chrom] = chr_seq[1]
        reference_genome_masked_STR[chrom] = reference_genome_masked_STR[chrom].upper()

## Inverted repeats <a name="DB_IR"></a>

#### Find all IR 5-mers <a name="DB_IR_5mer"></a>
    make directory './custom_db/temp/' for output
[Return to Table of Contents](#TOC)

In [None]:
N = 5   # N-mer size
max_spacer = 100    # Max spacer size

In [None]:
RC_match = dict()
for chrom in range(chr_range,23):
    # Find all 5-mers in chromosome
    Nmer_sets_overlap = pd.DataFrame(re.findall('.'*N, reference_genome_masked_STR[chrom], overlapped = True), columns = ['seq'])
    Nmer_sets_overlap = Nmer_sets_overlap.loc[Nmer_sets_overlap['seq'].str.count('N') ==0]
    Nmer_sets_overlap['L_start'] = Nmer_sets_overlap.index
    Nmer_sets_overlap.reset_index(drop = True, inplace = True)
    # Get reverse_complement of each 5-mer
    Nmer_sets_overlap['seq_RC'] = [reverse_complement(seq) for seq in Nmer_sets_overlap['seq']]
    Nmer_sets_overlap['R_start'] = Nmer_sets_overlap['L_start']
    # For each spacer length, check if FWD 5-mer = REVC 5-mer
    Nmer_sets_overlap['spacer'] = -N
    RC_match[chrom] = dict()
    for n_shifts in range(max_spacer+N+1):
        RC_match[chrom][n_shifts] = Nmer_sets_overlap.loc[Nmer_sets_overlap['seq'] == Nmer_sets_overlap['seq_RC']]
        Nmer_sets_overlap['seq_RC'] = Nmer_sets_overlap['seq_RC'].shift(-1)
        Nmer_sets_overlap['R_start'] = Nmer_sets_overlap['R_start'].shift(-1)
        Nmer_sets_overlap['spacer'] = Nmer_sets_overlap['spacer'] + 1
        print('\r' + 'shifting spacer ' +str(n_shifts) + ' chr' + str(chrom), end='        ')
    RC_match[chrom] = pd.concat(RC_match[chrom])
    RC_match[chrom]['chrom'] = chrom
RC_match = pd.concat(RC_match)
RC_match.reset_index(drop = True, inplace = True)

In [None]:
del Nmer_sets_overlap
IR_5mers_expand = dict()
IR_5mers_expand[N] = RC_match.loc[RC_match['spacer'] >= 0][['chrom', 'L_start', 'R_start']].copy()
IR_5mers_expand[N]['R_start'] = IR_5mers_expand[N]['R_start'].astype(int)
del RC_match
# save temporary file
IR_5mers_expand[N].to_csv('./custom_db/temp/IR_SL'+str(N)+'_SP'+str(max_spacer)+'_chr'+str(chr_range)+'-22.csv.gz', compression = 'gzip', index = False)

#### Expand perfect IRs <a name="DB_IR_expand"></a>
[Return to Table of Contents](#TOC)

In [None]:
# Load temporary file
stem_len = 5    # Initial N-mer size from above section
max_spacer = 100  # Max spacer size from above section

IR_5mers_expand = dict()
IR_5mers_expand[stem_len] = pd.read_csv('./custom_db/temp/IR_SL'+str(stem_len)+'_SP'+str(max_spacer)+'_chr'+str(chr_range)+'-22.csv.gz', compression = 'gzip')

In [None]:
IR_5mers_expand[stem_len]['stem_len'] = stem_len
IR_5mers_expand[stem_len]['L_end'] = IR_5mers_expand[stem_len]['L_start'] + stem_len
IR_5mers_expand[stem_len]['R_end'] = IR_5mers_expand[stem_len]['R_start'] + stem_len
IR_5mers_expand[5]['spacer'] = IR_5mers_expand[5]['R_start'] - IR_5mers_expand[5]['L_end']
IR_5mers_expand[5] = IR_5mers_expand[5].loc[IR_5mers_expand[5]['spacer'] <=100]

# Continue IR search outwards from initial matches
list_len = len(IR_5mers_expand[stem_len])
while list_len > 0:
    IR_5mers_expand[stem_len]['seq_L'] = [reference_genome_masked_STR[chrom][start-1:end] for chrom, start, end in zip(IR_5mers_expand[stem_len]['chrom'], IR_5mers_expand[stem_len]['L_start'], IR_5mers_expand[stem_len]['L_end'])]
    IR_5mers_expand[stem_len]['seq_R'] = [reference_genome_masked_STR[chrom][start:end+1] for chrom, start, end in zip(IR_5mers_expand[stem_len]['chrom'], IR_5mers_expand[stem_len]['R_start'], IR_5mers_expand[stem_len]['R_end'])]
    IR_5mers_expand[stem_len]['seq_R'] = [reverse_complement(seq) for seq in IR_5mers_expand[stem_len]['seq_R']]
    IR_5mers_expand[stem_len+1] = IR_5mers_expand[stem_len].loc[(IR_5mers_expand[stem_len]['seq_L'] == IR_5mers_expand[stem_len]['seq_R']) & (IR_5mers_expand[stem_len]['seq_L'].str.count('N') == 0) & (IR_5mers_expand[stem_len]['seq_R'].str.count('N') == 0)]
    IR_5mers_expand[stem_len] = IR_5mers_expand[stem_len].loc[(IR_5mers_expand[stem_len]['seq_L'] != IR_5mers_expand[stem_len]['seq_R']) | (IR_5mers_expand[stem_len]['seq_L'].str.count('N') != 0) | (IR_5mers_expand[stem_len]['seq_R'].str.count('N') != 0)]
    stem_len +=1
    list_len = len(IR_5mers_expand[stem_len])
    IR_5mers_expand[stem_len]['L_start'] = IR_5mers_expand[stem_len]['L_start'] - 1
    IR_5mers_expand[stem_len]['R_end'] = IR_5mers_expand[stem_len]['R_end'] + 1
    IR_5mers_expand[stem_len]['stem_len'] = stem_len
    print('\r' + 'expanding out stem_len ' +str(stem_len), end='        ')
IR_5mers_expand = pd.concat(IR_5mers_expand)
del IR_5mers_expand['seq_L']; del IR_5mers_expand['seq_R']

# Remove duplicates and IRs that expand into masked regions
IR_5mers_expand = IR_5mers_expand.drop_duplicates(subset = ['L_start', 'L_end', 'R_start', 'R_end'], keep = 'first')
IR_5mers_expand['Sequence'] = [reference_genome_masked_STR[chrom][start:end] for chrom, start, end in zip(IR_5mers_expand['chrom'], IR_5mers_expand['L_start'], IR_5mers_expand['R_end'])]
IR_5mers_expand = IR_5mers_expand.loc[IR_5mers_expand['Sequence'].str.count('N') == 0]

In [None]:
# Continue IR search inward, unless spacer length is already 0
IR_5mers_expand_in = dict()
in_len = 0
IR_5mers_expand_in[in_len] = IR_5mers_expand.copy()

list_len = len(IR_5mers_expand_in[in_len])
while list_len > 0:
    IR_5mers_expand_in[in_len]['seq_L'] = [reference_genome_masked_STR[chrom][start:end+1] for chrom, start, end in zip(IR_5mers_expand_in[in_len]['chrom'], IR_5mers_expand_in[in_len]['L_start'], IR_5mers_expand_in[in_len]['L_end'])]
    IR_5mers_expand_in[in_len]['seq_R'] = [reference_genome_masked_STR[chrom][start-1:end] for chrom, start, end in zip(IR_5mers_expand_in[in_len]['chrom'], IR_5mers_expand_in[in_len]['R_start'], IR_5mers_expand_in[in_len]['R_end'])]
    IR_5mers_expand_in[in_len]['seq_R'] = [reverse_complement(seq) for seq in IR_5mers_expand_in[in_len]['seq_R']]
    IR_5mers_expand_in[in_len+1] = IR_5mers_expand_in[in_len].loc[(IR_5mers_expand_in[in_len]['seq_L'] == IR_5mers_expand_in[in_len]['seq_R']) & (IR_5mers_expand_in[in_len]['seq_L'].str.count('N') == 0) & (IR_5mers_expand_in[in_len]['seq_R'].str.count('N') == 0) & (IR_5mers_expand_in[in_len]['spacer'] > 1)]
    IR_5mers_expand_in[in_len] = IR_5mers_expand_in[in_len].loc[(IR_5mers_expand_in[in_len]['seq_L'] != IR_5mers_expand_in[in_len]['seq_R']) | (IR_5mers_expand_in[in_len]['seq_L'].str.count('N') != 0) | (IR_5mers_expand_in[in_len]['seq_R'].str.count('N') != 0) | (IR_5mers_expand_in[in_len]['spacer'] <= 1)]
    in_len +=1
    list_len = len(IR_5mers_expand_in[in_len])
    IR_5mers_expand_in[in_len]['R_start'] = IR_5mers_expand_in[in_len]['R_start'] - 1
    IR_5mers_expand_in[in_len]['L_end'] = IR_5mers_expand_in[in_len]['L_end'] + 1
    IR_5mers_expand_in[in_len]['spacer'] = IR_5mers_expand_in[in_len]['spacer'] -2
    print('\r' + 'expanding in ' +str(in_len), end='        ')

IR_5mers_expand_in = pd.concat(IR_5mers_expand_in).reset_index(drop = True)
del IR_5mers_expand_in['seq_L']; del IR_5mers_expand_in['seq_R']

# Remove duplicates and IRs that expand into masked regions
IR_5mers_expand_in = IR_5mers_expand_in.drop_duplicates(subset = ['L_start', 'L_end', 'R_start', 'R_end'], keep = 'first')
IR_5mers_expand_in['Sequence'] = [reference_genome_masked_STR[chrom][start:end] for chrom, start, end in zip(IR_5mers_expand_in['chrom'], IR_5mers_expand_in['L_start'], IR_5mers_expand_in['R_end'])]
IR_5mers_expand_in = IR_5mers_expand_in.loc[IR_5mers_expand_in['Sequence'].str.count('N') == 0]

#### Expand imperfect IRs <a name="DB_IR_expand_imperfect"></a>
[Return to Table of Contents](#TOC)

In [None]:
# Ignore one mismatch, then search again outwards
IR_5mers_expand_imperfect = dict()
imp_len = 0
IR_5mers_expand_imperfect[imp_len] = IR_5mers_expand_in.copy()

IR_5mers_expand_imperfect[imp_len]['imp_seq_L'] = [reference_genome_masked_STR[chrom][start-2:start-1] for chrom, start in zip(IR_5mers_expand_imperfect[imp_len]['chrom'], IR_5mers_expand_imperfect[imp_len]['L_start'])]
IR_5mers_expand_imperfect[imp_len]['imp_seq_R'] = [reference_genome_masked_STR[chrom][end+1:end+2] for chrom, end in zip(IR_5mers_expand_imperfect[imp_len]['chrom'], IR_5mers_expand_imperfect[imp_len]['R_end'])]
IR_5mers_expand_imperfect[imp_len]['imp_seq_R'] = [reverse_complement(seq) for seq in IR_5mers_expand_imperfect[imp_len]['imp_seq_R']]
IR_5mers_expand_imperfect[imp_len+1] = IR_5mers_expand_imperfect[imp_len].loc[(IR_5mers_expand_imperfect[imp_len]['imp_seq_L'] == IR_5mers_expand_imperfect[imp_len]['imp_seq_R']) & (IR_5mers_expand_imperfect[imp_len]['imp_seq_L'].str.count('N') == 0) & (IR_5mers_expand_imperfect[imp_len]['imp_seq_R'].str.count('N') == 0)]
IR_5mers_expand_imperfect[imp_len] = IR_5mers_expand_imperfect[imp_len].loc[(IR_5mers_expand_imperfect[imp_len]['imp_seq_L'] != IR_5mers_expand_imperfect[imp_len]['imp_seq_R']) | (IR_5mers_expand_imperfect[imp_len]['imp_seq_L'].str.count('N') != 0) | (IR_5mers_expand_imperfect[imp_len]['imp_seq_R'].str.count('N') != 0)]
imp_len +=1
IR_5mers_expand_imperfect[imp_len]['L_start'] = IR_5mers_expand_imperfect[imp_len]['L_start'] - 2
IR_5mers_expand_imperfect[imp_len]['R_end'] = IR_5mers_expand_imperfect[imp_len]['R_end'] + 2

list_len = len(IR_5mers_expand_imperfect[imp_len])
while list_len > 0:
    IR_5mers_expand_imperfect[imp_len]['imp_seq_L'] = [reference_genome_masked_STR[chrom][start-1:start] for chrom, start in zip(IR_5mers_expand_imperfect[imp_len]['chrom'], IR_5mers_expand_imperfect[imp_len]['L_start'])]
    IR_5mers_expand_imperfect[imp_len]['imp_seq_R'] = [reference_genome_masked_STR[chrom][end:end+1] for chrom, end in zip(IR_5mers_expand_imperfect[imp_len]['chrom'], IR_5mers_expand_imperfect[imp_len]['R_end'])]
    IR_5mers_expand_imperfect[imp_len]['imp_seq_R'] = [reverse_complement(seq) for seq in IR_5mers_expand_imperfect[imp_len]['imp_seq_R']]
    IR_5mers_expand_imperfect[imp_len+1] = IR_5mers_expand_imperfect[imp_len].loc[(IR_5mers_expand_imperfect[imp_len]['imp_seq_L'] == IR_5mers_expand_imperfect[imp_len]['imp_seq_R']) & (IR_5mers_expand_imperfect[imp_len]['imp_seq_L'].str.count('N') == 0) & (IR_5mers_expand_imperfect[imp_len]['imp_seq_R'].str.count('N') == 0)]
    IR_5mers_expand_imperfect[imp_len] = IR_5mers_expand_imperfect[imp_len].loc[(IR_5mers_expand_imperfect[imp_len]['imp_seq_L'] != IR_5mers_expand_imperfect[imp_len]['imp_seq_R']) | (IR_5mers_expand_imperfect[imp_len]['imp_seq_L'].str.count('N') != 0) | (IR_5mers_expand_imperfect[imp_len]['imp_seq_R'].str.count('N') != 0)]
    imp_len +=1
    list_len = len(IR_5mers_expand_imperfect[imp_len])
    IR_5mers_expand_imperfect[imp_len]['L_start'] = IR_5mers_expand_imperfect[imp_len]['L_start'] - 1
    IR_5mers_expand_imperfect[imp_len]['R_end'] = IR_5mers_expand_imperfect[imp_len]['R_end'] + 1
    print('\r' + 'expanding out imp_len ' +str(imp_len), end='        ')

IR_5mers_expand_imperfect_out = pd.concat(IR_5mers_expand_imperfect).reset_index(drop = True)
IR_5mers_expand_imperfect_out['spacer'] = IR_5mers_expand_imperfect_out['R_start'] - IR_5mers_expand_imperfect_out['L_end']
IR_5mers_expand_imperfect_out = IR_5mers_expand_imperfect_out.loc[IR_5mers_expand_imperfect_out['spacer'] >= 0]

In [None]:
# Ignore one mismatch, then search again inwards, unless spacer is already 0
IR_5mers_expand_imperfect = dict()
imp_len = 0
IR_5mers_expand_imperfect[imp_len] = IR_5mers_expand_imperfect_out.copy()

IR_5mers_expand_imperfect[imp_len]['imp_seq_L'] = [reference_genome_masked_STR[chrom][end+1:end+2] for chrom, end in zip(IR_5mers_expand_imperfect[imp_len]['chrom'], IR_5mers_expand_imperfect[imp_len]['L_end'])]
IR_5mers_expand_imperfect[imp_len]['imp_seq_R'] = [reference_genome_masked_STR[chrom][start-2:start-1] for chrom, start in zip(IR_5mers_expand_imperfect[imp_len]['chrom'], IR_5mers_expand_imperfect[imp_len]['R_start'])]
IR_5mers_expand_imperfect[imp_len]['imp_seq_R'] = [reverse_complement(seq) for seq in IR_5mers_expand_imperfect[imp_len]['imp_seq_R']]
IR_5mers_expand_imperfect[imp_len+1] = IR_5mers_expand_imperfect[imp_len].loc[(IR_5mers_expand_imperfect[imp_len]['imp_seq_L'] == IR_5mers_expand_imperfect[imp_len]['imp_seq_R']) & (IR_5mers_expand_imperfect[imp_len]['imp_seq_L'].str.count('N') == 0) & (IR_5mers_expand_imperfect[imp_len]['imp_seq_R'].str.count('N') == 0) & (IR_5mers_expand_imperfect[imp_len]['spacer'] > 3)]
IR_5mers_expand_imperfect[imp_len] = IR_5mers_expand_imperfect[imp_len].loc[(IR_5mers_expand_imperfect[imp_len]['imp_seq_L'] != IR_5mers_expand_imperfect[imp_len]['imp_seq_R']) | (IR_5mers_expand_imperfect[imp_len]['imp_seq_L'].str.count('N') != 0) | (IR_5mers_expand_imperfect[imp_len]['imp_seq_R'].str.count('N') != 0) | (IR_5mers_expand_imperfect[imp_len]['spacer'] <= 3)]
imp_len +=1
IR_5mers_expand_imperfect[imp_len]['L_end'] = IR_5mers_expand_imperfect[imp_len]['L_end'] + 2
IR_5mers_expand_imperfect[imp_len]['R_start'] = IR_5mers_expand_imperfect[imp_len]['R_start'] - 2
IR_5mers_expand_imperfect[imp_len]['spacer'] = IR_5mers_expand_imperfect[imp_len]['spacer'] - 4

list_len = len(IR_5mers_expand_imperfect[imp_len])
while list_len > 0:
    IR_5mers_expand_imperfect[imp_len]['imp_seq_L'] = [reference_genome_masked_STR[chrom][end:end+1] for chrom, end in zip(IR_5mers_expand_imperfect[imp_len]['chrom'], IR_5mers_expand_imperfect[imp_len]['L_end'])]
    IR_5mers_expand_imperfect[imp_len]['imp_seq_R'] = [reference_genome_masked_STR[chrom][start-1:start] for chrom, start in zip(IR_5mers_expand_imperfect[imp_len]['chrom'], IR_5mers_expand_imperfect[imp_len]['R_start'])]
    IR_5mers_expand_imperfect[imp_len]['imp_seq_R'] = [reverse_complement(seq) for seq in IR_5mers_expand_imperfect[imp_len]['imp_seq_R']]
    IR_5mers_expand_imperfect[imp_len+1] = IR_5mers_expand_imperfect[imp_len].loc[(IR_5mers_expand_imperfect[imp_len]['imp_seq_L'] == IR_5mers_expand_imperfect[imp_len]['imp_seq_R']) & (IR_5mers_expand_imperfect[imp_len]['imp_seq_L'].str.count('N') == 0) & (IR_5mers_expand_imperfect[imp_len]['imp_seq_R'].str.count('N') == 0) & (IR_5mers_expand_imperfect[imp_len]['spacer'] > 1)]
    IR_5mers_expand_imperfect[imp_len] = IR_5mers_expand_imperfect[imp_len].loc[(IR_5mers_expand_imperfect[imp_len]['imp_seq_L'] != IR_5mers_expand_imperfect[imp_len]['imp_seq_R']) | (IR_5mers_expand_imperfect[imp_len]['imp_seq_L'].str.count('N') != 0) | (IR_5mers_expand_imperfect[imp_len]['imp_seq_R'].str.count('N') != 0) | (IR_5mers_expand_imperfect[imp_len]['spacer'] <= 1)]
    imp_len +=1
    list_len = len(IR_5mers_expand_imperfect[imp_len])
    IR_5mers_expand_imperfect[imp_len]['L_end'] = IR_5mers_expand_imperfect[imp_len]['L_end'] + 1
    IR_5mers_expand_imperfect[imp_len]['R_start'] = IR_5mers_expand_imperfect[imp_len]['R_start'] - 1
    IR_5mers_expand_imperfect[imp_len]['spacer'] = IR_5mers_expand_imperfect[imp_len]['spacer'] - 2
    print('\r' + 'expanding in imp_len ' +str(imp_len), end='        ')

IR_5mers_expand_imperfect = pd.concat(IR_5mers_expand_imperfect).reset_index(drop = True)

In [None]:
# Format dataframe and remove duplicates
IR_5mers_expand_imperfect = IR_5mers_expand_imperfect.drop_duplicates(subset = ['L_start', 'R_start', 'L_end', 'R_end']).copy()
IR_5mers_expand_imperfect['stem_len'] = IR_5mers_expand_imperfect['L_end'] - IR_5mers_expand_imperfect['L_start']
IR_expand_imperfect = IR_5mers_expand_imperfect.loc[IR_5mers_expand_imperfect['stem_len'] >9].copy()
IR_expand_imperfect['seq_L'] = [reference_genome_masked_STR[chrom][start:end] for chrom, start, end in zip(IR_expand_imperfect['chrom'], IR_expand_imperfect['L_start'], IR_expand_imperfect['L_end'])]
IR_expand_imperfect['seq_R'] = [reference_genome_masked_STR[chrom][start:end] for chrom, start, end in zip(IR_expand_imperfect['chrom'], IR_expand_imperfect['R_start'], IR_expand_imperfect['R_end'])]
IR_expand_imperfect['RC_seq_R'] = [reverse_complement(seq) for seq in IR_expand_imperfect['seq_R']]
IR_expand_imperfect['Sequence'] = [reference_genome_masked_STR[chrom][start:end] for chrom, start, end in zip(IR_expand_imperfect['chrom'], IR_expand_imperfect['L_start'], IR_expand_imperfect['R_end'])]
IR_expand_imperfect['freq'] = IR_expand_imperfect.groupby('seq_L')['seq_L'].transform('count')  # frequency of stem sequence in database, ignoring spacer sequence
IR_expand_imperfect['spacer'] = IR_expand_imperfect['R_start'] - (IR_expand_imperfect['L_end'])
IR_expand_imperfect = IR_expand_imperfect.loc[IR_expand_imperfect['Sequence'].str.count('N') == 0]
IR_expand_imperfect = IR_expand_imperfect.sort_values(by = ['chrom', 'L_start', 'R_end']).reset_index(drop = True)
del IR_expand_imperfect['imp_seq_L']; del IR_expand_imperfect['imp_seq_R']
IR_expand_imperfect['#MM'] = [sum(1 for a, b in zip(seq1, seq2) if a != b) for seq1, seq2 in zip(IR_expand_imperfect['seq_L'], IR_expand_imperfect['RC_seq_R'])]
IR_expand_imperfect = IR_expand_imperfect.sort_values(by = ['chrom', 'L_start', 'R_end', 'stem_len']).reset_index(drop = True)
IR_expand_imperfect = IR_expand_imperfect.drop_duplicates(subset = ['chrom', 'L_start', 'R_end'], keep = 'last')

#### Save IR data <a name="DB_IR_save"></a>

[Return to Table of Contents](#TOC)

In [None]:
# Save IR database
IR_expand_imperfect.to_csv('./custom_db/inverted_repeats_withoutSTRs_imperfect_chr'+str(chr_range)+'-22.csv.gz', compression = 'gzip', index = False)

## Mirror repeats <a name="DB_MR"></a>

#### Find all MR 5-mers <a name="DB_MR_5mer"></a>

[Return to Table of Contents](#TOC)

In [None]:
N = 5   # N-mer size
max_spacer = 100    # Max spacer size

In [None]:
rev_match = dict()
for chrom in range(chr_range,23):
    # Find all 5-mers in chromosome
    Nmer_sets_overlap = pd.DataFrame(re.findall('.'*N, reference_genome_masked_STR[chrom], overlapped = True), columns = ['seq'])
    Nmer_sets_overlap = Nmer_sets_overlap.loc[Nmer_sets_overlap['seq'].str.count('N') ==0]
    Nmer_sets_overlap['L_start'] = Nmer_sets_overlap.index
    Nmer_sets_overlap.reset_index(drop = True, inplace = True)
    # Get reverse sequence of each 5-mer
    Nmer_sets_overlap['seq_rev'] = [seq[::-1] for seq in Nmer_sets_overlap['seq']]
    Nmer_sets_overlap['R_start'] = Nmer_sets_overlap['L_start']
    Nmer_sets_overlap['spacer'] = -N
    rev_match[chrom] = dict()
    # For each spacer length, check if FWD 5-mer = REV 5-mer
    for n_shifts in range(max_spacer+N+1):
        rev_match[chrom][n_shifts] = Nmer_sets_overlap.loc[Nmer_sets_overlap['seq'] == Nmer_sets_overlap['seq_rev']]
        Nmer_sets_overlap['seq_rev'] = Nmer_sets_overlap['seq_rev'].shift(-1)
        Nmer_sets_overlap['R_start'] = Nmer_sets_overlap['R_start'].shift(-1)
        Nmer_sets_overlap['spacer'] = Nmer_sets_overlap['spacer'] + 1
        print('\r' + 'shifting spacer ' +str(n_shifts) + ' chr' + str(chrom), end='        ')
    rev_match[chrom] = pd.concat(rev_match[chrom])
    rev_match[chrom]['chrom'] = chrom
rev_match = pd.concat(rev_match)
rev_match.reset_index(drop = True, inplace = True)

In [None]:
del Nmer_sets_overlap
MR_5mers_expand = dict()
MR_5mers_expand[N] = rev_match.loc[rev_match['spacer'] >= 0][['chrom', 'L_start', 'R_start']].copy()
MR_5mers_expand[N]['R_start'] = MR_5mers_expand[N]['R_start'].astype(int)
del rev_match
# Save temporary file
MR_5mers_expand[N].to_csv('./custom_db/temp/MR_SL'+str(N)+'_SP'+str(max_spacer)+'_chr'+str(chr_range)+'-22.csv.gz', compression = 'gzip', index = False)

#### Expand perfect MRs  <a name="DB_MR_expand"></a>
[Return to Table of Contents](#TOC)

In [None]:
# Load temporary file
stem_len = 5    # Initial N-mer size from above section
max_spacer = 100  # Max spacer size from above section
MR_5mers_expand = dict()
MR_5mers_expand[stem_len] = pd.read_csv('./custom_db/temp/MR_SL'+str(N)+'_SP'+str(max_spacer)+'_chr'+str(chr_range)+'-22.csv.gz', compression = 'gzip')

In [None]:
MR_5mers_expand[stem_len]['stem_len'] = stem_len
MR_5mers_expand[stem_len]['L_end'] = MR_5mers_expand[stem_len]['L_start'] + stem_len
MR_5mers_expand[stem_len]['R_end'] = MR_5mers_expand[stem_len]['R_start'] + stem_len
MR_5mers_expand[5]['spacer'] = MR_5mers_expand[5]['R_start'] - MR_5mers_expand[5]['L_end']
MR_5mers_expand[5] = MR_5mers_expand[5].loc[MR_5mers_expand[5]['spacer'] <=100]

# Continue MR search outwards from initial matches
list_len = len(MR_5mers_expand[stem_len])
while list_len > 0:
    MR_5mers_expand[stem_len]['seq_L'] = [reference_genome_masked_STR[chrom][start-1:end] for chrom, start, end in zip(MR_5mers_expand[stem_len]['chrom'], MR_5mers_expand[stem_len]['L_start'], MR_5mers_expand[stem_len]['L_end'])]
    MR_5mers_expand[stem_len]['seq_R'] = [reference_genome_masked_STR[chrom][start:end+1] for chrom, start, end in zip(MR_5mers_expand[stem_len]['chrom'], MR_5mers_expand[stem_len]['R_start'], MR_5mers_expand[stem_len]['R_end'])]
    MR_5mers_expand[stem_len]['seq_R'] = [seq[::-1] for seq in MR_5mers_expand[stem_len]['seq_R']]
    MR_5mers_expand[stem_len+1] = MR_5mers_expand[stem_len].loc[(MR_5mers_expand[stem_len]['seq_L'] == MR_5mers_expand[stem_len]['seq_R']) & (MR_5mers_expand[stem_len]['seq_L'].str.count('N') == 0) & (MR_5mers_expand[stem_len]['seq_R'].str.count('N') == 0)]
    MR_5mers_expand[stem_len] = MR_5mers_expand[stem_len].loc[(MR_5mers_expand[stem_len]['seq_L'] != MR_5mers_expand[stem_len]['seq_R']) | (MR_5mers_expand[stem_len]['seq_L'].str.count('N') != 0) | (MR_5mers_expand[stem_len]['seq_R'].str.count('N') != 0)]
    stem_len +=1
    list_len = len(MR_5mers_expand[stem_len])
    MR_5mers_expand[stem_len]['L_start'] = MR_5mers_expand[stem_len]['L_start'] - 1
    MR_5mers_expand[stem_len]['R_end'] = MR_5mers_expand[stem_len]['R_end'] + 1
    MR_5mers_expand[stem_len]['stem_len'] = stem_len
    print('\r' + 'expanding out stem_len ' +str(stem_len), end='        ')

MR_5mers_expand = pd.concat(MR_5mers_expand)
del MR_5mers_expand['seq_L']; del MR_5mers_expand['seq_R']
# remove duplicates and IRs that expand into masked regions
MR_5mers_expand = MR_5mers_expand.drop_duplicates(subset = ['L_start', 'L_end', 'R_start', 'R_end'], keep = 'first')
MR_5mers_expand['Sequence'] = [reference_genome_masked_STR[chrom][start:end] for chrom, start, end in zip(MR_5mers_expand['chrom'], MR_5mers_expand['L_start'], MR_5mers_expand['R_end'])]
MR_5mers_expand = MR_5mers_expand.loc[MR_5mers_expand['Sequence'].str.count('N') == 0]

In [None]:
# Continue MR search inward, unless spacer length is already 0
MR_5mers_expand_in = dict()
in_len = 0
MR_5mers_expand_in[in_len] = MR_5mers_expand.copy()
list_len = len(MR_5mers_expand_in[in_len])
while list_len > 0:
    MR_5mers_expand_in[in_len]['seq_L'] = [reference_genome_masked_STR[chrom][start:end+1] for chrom, start, end in zip(MR_5mers_expand_in[in_len]['chrom'], MR_5mers_expand_in[in_len]['L_start'], MR_5mers_expand_in[in_len]['L_end'])]
    MR_5mers_expand_in[in_len]['seq_R'] = [reference_genome_masked_STR[chrom][start-1:end] for chrom, start, end in zip(MR_5mers_expand_in[in_len]['chrom'], MR_5mers_expand_in[in_len]['R_start'], MR_5mers_expand_in[in_len]['R_end'])]
    MR_5mers_expand_in[in_len]['seq_R'] = [seq[::-1] for seq in MR_5mers_expand_in[in_len]['seq_R']]
    MR_5mers_expand_in[in_len+1] = MR_5mers_expand_in[in_len].loc[(MR_5mers_expand_in[in_len]['seq_L'] == MR_5mers_expand_in[in_len]['seq_R']) & (MR_5mers_expand_in[in_len]['seq_L'].str.count('N') == 0) & (MR_5mers_expand_in[in_len]['seq_R'].str.count('N') == 0) & (MR_5mers_expand_in[in_len]['spacer'] > 1)]
    MR_5mers_expand_in[in_len] = MR_5mers_expand_in[in_len].loc[(MR_5mers_expand_in[in_len]['seq_L'] != MR_5mers_expand_in[in_len]['seq_R']) | (MR_5mers_expand_in[in_len]['seq_L'].str.count('N') != 0) | (MR_5mers_expand_in[in_len]['seq_R'].str.count('N') != 0) | (MR_5mers_expand_in[in_len]['spacer'] <= 1)]
    in_len +=1
    list_len = len(MR_5mers_expand_in[in_len])
    MR_5mers_expand_in[in_len]['R_start'] = MR_5mers_expand_in[in_len]['R_start'] - 1
    MR_5mers_expand_in[in_len]['L_end'] = MR_5mers_expand_in[in_len]['L_end'] + 1
    MR_5mers_expand_in[in_len]['spacer'] = MR_5mers_expand_in[in_len]['spacer'] -2
    print('\r' + 'expanding in ' +str(in_len), end='        ')

MR_5mers_expand_in = pd.concat(MR_5mers_expand_in).reset_index(drop = True)
del MR_5mers_expand_in['seq_L']; del MR_5mers_expand_in['seq_R']

MR_5mers_expand_in = MR_5mers_expand_in.drop_duplicates(subset = ['L_start', 'L_end', 'R_start', 'R_end'], keep = 'first')
MR_5mers_expand_in['Sequence'] = [reference_genome_masked_STR[chrom][start:end] for chrom, start, end in zip(MR_5mers_expand_in['chrom'], MR_5mers_expand_in['L_start'], MR_5mers_expand_in['R_end'])]
MR_5mers_expand_in = MR_5mers_expand_in.loc[MR_5mers_expand_in['Sequence'].str.count('N') == 0]

#### Expand imperfect MRs <a name="DB_MR_expand_imperfect"></a>
[Return to Table of Contents](#TOC)

In [None]:
# Ignore one mismatch, then search again outwards
MR_5mers_expand_imperfect = dict()
imp_len = 0
MR_5mers_expand_imperfect[imp_len] = MR_5mers_expand_in.copy()

MR_5mers_expand_imperfect[imp_len]['imp_seq_L'] = [reference_genome_masked_STR[chrom][start-2:start-1] for chrom, start in zip(MR_5mers_expand_imperfect[imp_len]['chrom'], MR_5mers_expand_imperfect[imp_len]['L_start'])]
MR_5mers_expand_imperfect[imp_len]['imp_seq_R'] = [reference_genome_masked_STR[chrom][end+1:end+2] for chrom, end in zip(MR_5mers_expand_imperfect[imp_len]['chrom'], MR_5mers_expand_imperfect[imp_len]['R_end'])]
MR_5mers_expand_imperfect[imp_len]['imp_seq_R'] = [seq[::-1] for seq in MR_5mers_expand_imperfect[imp_len]['imp_seq_R']]
MR_5mers_expand_imperfect[imp_len+1] = MR_5mers_expand_imperfect[imp_len].loc[(MR_5mers_expand_imperfect[imp_len]['imp_seq_L'] == MR_5mers_expand_imperfect[imp_len]['imp_seq_R']) & (MR_5mers_expand_imperfect[imp_len]['imp_seq_L'].str.count('N') == 0) & (MR_5mers_expand_imperfect[imp_len]['imp_seq_R'].str.count('N') == 0)]
MR_5mers_expand_imperfect[imp_len] = MR_5mers_expand_imperfect[imp_len].loc[(MR_5mers_expand_imperfect[imp_len]['imp_seq_L'] != MR_5mers_expand_imperfect[imp_len]['imp_seq_R']) | (MR_5mers_expand_imperfect[imp_len]['imp_seq_L'].str.count('N') != 0) | (MR_5mers_expand_imperfect[imp_len]['imp_seq_R'].str.count('N') != 0)]
imp_len +=1
MR_5mers_expand_imperfect[imp_len]['L_start'] = MR_5mers_expand_imperfect[imp_len]['L_start'] - 2
MR_5mers_expand_imperfect[imp_len]['R_end'] = MR_5mers_expand_imperfect[imp_len]['R_end'] + 2

list_len = len(MR_5mers_expand_imperfect[imp_len])
while list_len > 0:
    MR_5mers_expand_imperfect[imp_len]['imp_seq_L'] = [reference_genome_masked_STR[chrom][start-1:start] for chrom, start in zip(MR_5mers_expand_imperfect[imp_len]['chrom'], MR_5mers_expand_imperfect[imp_len]['L_start'])]
    MR_5mers_expand_imperfect[imp_len]['imp_seq_R'] = [reference_genome_masked_STR[chrom][end:end+1] for chrom, end in zip(MR_5mers_expand_imperfect[imp_len]['chrom'], MR_5mers_expand_imperfect[imp_len]['R_end'])]
    MR_5mers_expand_imperfect[imp_len]['imp_seq_R'] = [seq[::-1] for seq in MR_5mers_expand_imperfect[imp_len]['imp_seq_R']]
    MR_5mers_expand_imperfect[imp_len+1] = MR_5mers_expand_imperfect[imp_len].loc[(MR_5mers_expand_imperfect[imp_len]['imp_seq_L'] == MR_5mers_expand_imperfect[imp_len]['imp_seq_R']) & (MR_5mers_expand_imperfect[imp_len]['imp_seq_L'].str.count('N') == 0) & (MR_5mers_expand_imperfect[imp_len]['imp_seq_R'].str.count('N') == 0)]
    MR_5mers_expand_imperfect[imp_len] = MR_5mers_expand_imperfect[imp_len].loc[(MR_5mers_expand_imperfect[imp_len]['imp_seq_L'] != MR_5mers_expand_imperfect[imp_len]['imp_seq_R']) | (MR_5mers_expand_imperfect[imp_len]['imp_seq_L'].str.count('N') != 0) | (MR_5mers_expand_imperfect[imp_len]['imp_seq_R'].str.count('N') != 0)]
    imp_len +=1
    list_len = len(MR_5mers_expand_imperfect[imp_len])
    MR_5mers_expand_imperfect[imp_len]['L_start'] = MR_5mers_expand_imperfect[imp_len]['L_start'] - 1
    MR_5mers_expand_imperfect[imp_len]['R_end'] = MR_5mers_expand_imperfect[imp_len]['R_end'] + 1
    print('\r' + 'expanding out imp_len ' +str(imp_len), end='        ')

MR_5mers_expand_imperfect_out = pd.concat(MR_5mers_expand_imperfect).reset_index(drop = True)
MR_5mers_expand_imperfect_out['spacer'] = MR_5mers_expand_imperfect_out['R_start'] - MR_5mers_expand_imperfect_out['L_end']
MR_5mers_expand_imperfect_out = MR_5mers_expand_imperfect_out.loc[MR_5mers_expand_imperfect_out['spacer'] >= 0]

MR_5mers_expand_imperfect_out['seq_L'] = [reference_genome_masked_STR[chrom][start:end] for chrom, start, end in zip(MR_5mers_expand_imperfect_out['chrom'], MR_5mers_expand_imperfect_out['L_start'], MR_5mers_expand_imperfect_out['L_end'])]
MR_5mers_expand_imperfect_out['seq_R'] = [reference_genome_masked_STR[chrom][start:end] for chrom, start, end in zip(MR_5mers_expand_imperfect_out['chrom'], MR_5mers_expand_imperfect_out['R_start'], MR_5mers_expand_imperfect_out['R_end'])]
MR_5mers_expand_imperfect_out['rev_seq_R'] = [seq[::-1] for seq in MR_5mers_expand_imperfect_out['seq_R']]
MR_5mers_expand_imperfect_out['#MM'] = [sum(1 for a, b in zip(seq1, seq2) if a != b) for seq1, seq2 in zip(MR_5mers_expand_imperfect_out['seq_L'], MR_5mers_expand_imperfect_out['rev_seq_R'])]

In [None]:
# Ignore one mismatch, then search again inwards, unless spacer is already 0
MR_5mers_expand_imperfect = dict()
imp_len = 0
MR_5mers_expand_imperfect[imp_len] = MR_5mers_expand_imperfect_out.copy()
MR_5mers_expand_imperfect[imp_len]['imp_seq_L'] = [reference_genome_masked_STR[chrom][end+1:end+2] for chrom, end in zip(MR_5mers_expand_imperfect[imp_len]['chrom'], MR_5mers_expand_imperfect[imp_len]['L_end'])]
MR_5mers_expand_imperfect[imp_len]['imp_seq_R'] = [reference_genome_masked_STR[chrom][start-2:start-1] for chrom, start in zip(MR_5mers_expand_imperfect[imp_len]['chrom'], MR_5mers_expand_imperfect[imp_len]['R_start'])]
MR_5mers_expand_imperfect[imp_len]['imp_seq_R'] = [seq[::-1] for seq in MR_5mers_expand_imperfect[imp_len]['imp_seq_R']]
MR_5mers_expand_imperfect[imp_len+1] = MR_5mers_expand_imperfect[imp_len].loc[(MR_5mers_expand_imperfect[imp_len]['imp_seq_L'] == MR_5mers_expand_imperfect[imp_len]['imp_seq_R']) & (MR_5mers_expand_imperfect[imp_len]['imp_seq_L'].str.count('N') == 0) & (MR_5mers_expand_imperfect[imp_len]['imp_seq_R'].str.count('N') == 0) & (MR_5mers_expand_imperfect[imp_len]['spacer'] > 3)]
MR_5mers_expand_imperfect[imp_len] = MR_5mers_expand_imperfect[imp_len].loc[(MR_5mers_expand_imperfect[imp_len]['imp_seq_L'] != MR_5mers_expand_imperfect[imp_len]['imp_seq_R']) | (MR_5mers_expand_imperfect[imp_len]['imp_seq_L'].str.count('N') != 0) | (MR_5mers_expand_imperfect[imp_len]['imp_seq_R'].str.count('N') != 0) | (MR_5mers_expand_imperfect[imp_len]['spacer'] <= 3)]
imp_len +=1
MR_5mers_expand_imperfect[imp_len]['L_end'] = MR_5mers_expand_imperfect[imp_len]['L_end'] + 2
MR_5mers_expand_imperfect[imp_len]['R_start'] = MR_5mers_expand_imperfect[imp_len]['R_start'] - 2
MR_5mers_expand_imperfect[imp_len]['spacer'] = MR_5mers_expand_imperfect[imp_len]['spacer'] - 4

list_len = len(MR_5mers_expand_imperfect[imp_len])
while list_len > 0:
    MR_5mers_expand_imperfect[imp_len]['imp_seq_L'] = [reference_genome_masked_STR[chrom][end:end+1] for chrom, end in zip(MR_5mers_expand_imperfect[imp_len]['chrom'], MR_5mers_expand_imperfect[imp_len]['L_end'])]
    MR_5mers_expand_imperfect[imp_len]['imp_seq_R'] = [reference_genome_masked_STR[chrom][start-1:start] for chrom, start in zip(MR_5mers_expand_imperfect[imp_len]['chrom'], MR_5mers_expand_imperfect[imp_len]['R_start'])]
    MR_5mers_expand_imperfect[imp_len]['imp_seq_R'] = [seq[::-1] for seq in MR_5mers_expand_imperfect[imp_len]['imp_seq_R']]
    MR_5mers_expand_imperfect[imp_len+1] = MR_5mers_expand_imperfect[imp_len].loc[(MR_5mers_expand_imperfect[imp_len]['imp_seq_L'] == MR_5mers_expand_imperfect[imp_len]['imp_seq_R']) & (MR_5mers_expand_imperfect[imp_len]['imp_seq_L'].str.count('N') == 0) & (MR_5mers_expand_imperfect[imp_len]['imp_seq_R'].str.count('N') == 0) & (MR_5mers_expand_imperfect[imp_len]['spacer'] > 1)]
    MR_5mers_expand_imperfect[imp_len] = MR_5mers_expand_imperfect[imp_len].loc[(MR_5mers_expand_imperfect[imp_len]['imp_seq_L'] != MR_5mers_expand_imperfect[imp_len]['imp_seq_R']) | (MR_5mers_expand_imperfect[imp_len]['imp_seq_L'].str.count('N') != 0) | (MR_5mers_expand_imperfect[imp_len]['imp_seq_R'].str.count('N') != 0) | (MR_5mers_expand_imperfect[imp_len]['spacer'] <= 1)]
    imp_len +=1
    list_len = len(MR_5mers_expand_imperfect[imp_len])
    MR_5mers_expand_imperfect[imp_len]['L_end'] = MR_5mers_expand_imperfect[imp_len]['L_end'] + 1
    MR_5mers_expand_imperfect[imp_len]['R_start'] = MR_5mers_expand_imperfect[imp_len]['R_start'] - 1
    MR_5mers_expand_imperfect[imp_len]['spacer'] = MR_5mers_expand_imperfect[imp_len]['spacer'] - 2
    print('\r' + 'expanding in imp_len ' +str(imp_len), end='        ')

In [None]:
# Format dataframe and remove duplicates
MR_5mers_expand_imperfect = pd.concat(MR_5mers_expand_imperfect).reset_index(drop = True)
MR_5mers_expand_imperfect = MR_5mers_expand_imperfect.drop_duplicates(subset = ['L_start', 'R_start', 'L_end', 'R_end'])
MR_5mers_expand_imperfect['stem_len'] = MR_5mers_expand_imperfect['L_end'] - MR_5mers_expand_imperfect['L_start']
MR_expand_imperfect = MR_5mers_expand_imperfect.loc[MR_5mers_expand_imperfect['stem_len'] >9].copy()
MR_expand_imperfect['seq_L'] = [reference_genome_masked_STR[chrom][start:end] for chrom, start, end in zip(MR_expand_imperfect['chrom'], MR_expand_imperfect['L_start'], MR_expand_imperfect['L_end'])]
MR_expand_imperfect['seq_R'] = [reference_genome_masked_STR[chrom][start:end] for chrom, start, end in zip(MR_expand_imperfect['chrom'], MR_expand_imperfect['R_start'], MR_expand_imperfect['R_end'])]
MR_expand_imperfect['rev_seq_R'] = [seq[::-1] for seq in MR_expand_imperfect['seq_R']]
MR_expand_imperfect['Sequence'] = [reference_genome_masked_STR[chrom][start:end] for chrom, start, end in zip(MR_expand_imperfect['chrom'], MR_expand_imperfect['L_start'], MR_expand_imperfect['R_end'])]
MR_expand_imperfect['freq'] = MR_expand_imperfect.groupby('seq_L')['seq_L'].transform('count')
MR_expand_imperfect['spacer'] = MR_expand_imperfect['R_start'] - (MR_expand_imperfect['L_end'])
MR_expand_imperfect = MR_expand_imperfect.loc[MR_expand_imperfect['Sequence'].str.count('N') == 0]
MR_expand_imperfect = MR_expand_imperfect.sort_values(by = ['chrom', 'L_start', 'R_end']).reset_index(drop = True)
del MR_expand_imperfect['imp_seq_L']; del MR_expand_imperfect['imp_seq_R']
MR_expand_imperfect['#MM'] = [sum(1 for a, b in zip(seq1, seq2) if a != b) for seq1, seq2 in zip(MR_expand_imperfect['seq_L'], MR_expand_imperfect['rev_seq_R'])]
MR_expand_imperfect = MR_expand_imperfect.sort_values(by = ['chrom', 'L_start', 'R_end', 'stem_len']).reset_index(drop = True)
MR_expand_imperfect = MR_expand_imperfect.drop_duplicates(subset = ['chrom', 'L_start', 'R_end'], keep = 'last')

#### Save MR data <a name="DB_MR_save"></a>

[Return to Table of Contents](#TOC)

In [None]:
# Save MR database
MR_expand_imperfect.to_csv('./custom_db/mirror_repeats_withoutSTRs_imperfect_chr'+str(chr_range)+'-22.csv.gz', compression = 'gzip', index = False)

## Direct repeats <a name="DB_DR"></a>

#### Find all perfect DRs, starting from 5-mers and expanding left and right <a name="DB_DR_perfect"></a>

[Return to Table of Contents](#TOC)

In [None]:
N = 5   # N-mer size
max_spacer = 1000    # Allow spacer length up to 1000, which will allow excess room for expansion algorithm to find stem_len up to 100 with spacer up to 100

In [None]:
DR_5mers_expand = dict()
for chrom in range(chr_range,23):
    # Find every possible 5mer in the chromosome
    Nmer_sets_overlap = pd.DataFrame(re.findall('.'*N, reference_genome_masked_STR[chrom], overlapped = True), columns = ['seq'])
    print('\r' + 'gathered all chr' + str(chrom), end='           ')
    Nmer_sets_overlap['L_start'] = Nmer_sets_overlap.index
    Nmer_sets_overlap = Nmer_sets_overlap.loc[Nmer_sets_overlap['seq'].str.count('N') == 0]

    # Filter to those that appear twice or more
    Nmer_sets_overlap['freq'] = Nmer_sets_overlap.groupby('seq')['seq'].transform('count')
    Nmer_repeated = Nmer_sets_overlap.loc[Nmer_sets_overlap['freq'] >1]
    
    # Sort by 5mer sequence and starting position, then calculate spacer distance between every pair
    # Allow for overlapping direct repeats in initial search
    Nmer_repeated = Nmer_repeated.sort_values(by = ['seq', 'L_start'])
    Nmer_repeated['L_end'] = Nmer_repeated['L_start'] + N
    Nmer_filtered = dict()
    n_shifts = 0
    check_len = len(Nmer_repeated)
    while check_len >0:
        n_shifts += 1
        Nmer_repeated['R_start'] = Nmer_repeated['L_start'].shift(-n_shifts)
        Nmer_repeated['R_end'] = Nmer_repeated['L_end'].shift(-n_shifts)
        Nmer_repeated['spacer'] = Nmer_repeated['R_start'] - (Nmer_repeated['L_end'])
        Nmer_repeated['R_seq'] = Nmer_repeated['seq'].shift(-n_shifts)
        Nmer_repeated['n_shifts'] = n_shifts
        Nmer_filtered[n_shifts] = Nmer_repeated.loc[(Nmer_repeated['spacer'].abs() <max_spacer) & (Nmer_repeated['seq'] == Nmer_repeated['R_seq'])]
        Nmer_repeated = Nmer_repeated.loc[(Nmer_repeated['spacer'].abs() <max_spacer) & (Nmer_repeated['seq'] == Nmer_repeated['R_seq'])]
        check_len = len(Nmer_repeated)
        print('\r' + str(n_shifts) + ' shifts ' + str(check_len) + ' found - chr' + str(chrom), end='        ')

    Nmer_filtered = pd.concat(Nmer_filtered).sort_values(by = ['L_start', 'R_end']).reset_index(drop = True)
    del Nmer_filtered['R_seq']
    Nmer_filtered['stem_len'] = N

    DR_5mers_expand[chrom] = dict()
    stem_len = 5
    DR_5mers_expand[chrom][stem_len] = Nmer_filtered.copy()
    DR_5mers_expand[chrom][stem_len]['R_start'] = DR_5mers_expand[chrom][stem_len]['R_start'].astype(int)
    DR_5mers_expand[chrom][stem_len]['R_end'] = DR_5mers_expand[chrom][stem_len]['R_end'].astype(int)
    #Search left
    list_len = len(DR_5mers_expand[chrom][stem_len])
    while list_len > 0:
        DR_5mers_expand[chrom][stem_len]['seq_L'] = [reference_genome_masked_STR[chrom][start-1:end] for start, end in zip(DR_5mers_expand[chrom][stem_len]['L_start'], DR_5mers_expand[chrom][stem_len]['L_end'])]
        DR_5mers_expand[chrom][stem_len]['seq_R'] = [reference_genome_masked_STR[chrom][start-1:end] for start, end in zip(DR_5mers_expand[chrom][stem_len]['R_start'], DR_5mers_expand[chrom][stem_len]['R_end'])]

        DR_5mers_expand[chrom][stem_len+1] = DR_5mers_expand[chrom][stem_len].loc[(DR_5mers_expand[chrom][stem_len]['seq_L'] == DR_5mers_expand[chrom][stem_len]['seq_R']) & (DR_5mers_expand[chrom][stem_len]['seq_L'].str.count('N') == 0) & (DR_5mers_expand[chrom][stem_len]['seq_R'].str.count('N') == 0)]
        DR_5mers_expand[chrom][stem_len] = DR_5mers_expand[chrom][stem_len].loc[(DR_5mers_expand[chrom][stem_len]['seq_L'] != DR_5mers_expand[chrom][stem_len]['seq_R']) | (DR_5mers_expand[chrom][stem_len]['seq_L'].str.count('N') != 0) | (DR_5mers_expand[chrom][stem_len]['seq_R'].str.count('N') != 0)]

        stem_len +=1
        list_len = len(DR_5mers_expand[chrom][stem_len])
        DR_5mers_expand[chrom][stem_len]['L_start'] = DR_5mers_expand[chrom][stem_len]['L_start'] - 1
        DR_5mers_expand[chrom][stem_len]['R_start'] = DR_5mers_expand[chrom][stem_len]['R_start'] - 1
        DR_5mers_expand[chrom][stem_len]['stem_len'] = stem_len
        print('\r' + 'expanding left stem_len ' +str(stem_len) + ' chr' + str(chrom), end='        ')

    #Search right
    stem_len = 5
    list_len = len(DR_5mers_expand[chrom][stem_len])
    while list_len > 0:
        DR_5mers_expand[chrom][stem_len]['seq_L'] = [reference_genome_masked_STR[chrom][start:end+1] for start, end in zip(DR_5mers_expand[chrom][stem_len]['L_start'], DR_5mers_expand[chrom][stem_len]['L_end'])]
        DR_5mers_expand[chrom][stem_len]['seq_R'] = [reference_genome_masked_STR[chrom][start:end+1] for start, end in zip(DR_5mers_expand[chrom][stem_len]['R_start'], DR_5mers_expand[chrom][stem_len]['R_end'])]

        # [stem_len+1] is already populated from left search, must change positions first, then concat
        move_pos = DR_5mers_expand[chrom][stem_len].loc[(DR_5mers_expand[chrom][stem_len]['seq_L'] == DR_5mers_expand[chrom][stem_len]['seq_R']) & (DR_5mers_expand[chrom][stem_len]['seq_L'].str.count('N') == 0) & (DR_5mers_expand[chrom][stem_len]['seq_R'].str.count('N') == 0)].copy()
        move_pos['L_end'] = move_pos['L_end'] + 1
        move_pos['R_end'] = move_pos['R_end'] + 1
        move_pos['stem_len'] = stem_len+1
        if stem_len+1 in DR_5mers_expand[chrom].keys():
            DR_5mers_expand[chrom][stem_len+1] = pd.concat([DR_5mers_expand[chrom][stem_len+1], move_pos])
        else:
            DR_5mers_expand[chrom][stem_len+1] = move_pos
        DR_5mers_expand[chrom][stem_len] = DR_5mers_expand[chrom][stem_len].loc[(DR_5mers_expand[chrom][stem_len]['seq_L'] != DR_5mers_expand[chrom][stem_len]['seq_R']) | (DR_5mers_expand[chrom][stem_len]['seq_L'].str.count('N') != 0) | (DR_5mers_expand[chrom][stem_len]['seq_R'].str.count('N') != 0)]

        stem_len +=1
        list_len = len(DR_5mers_expand[chrom][stem_len])
        print('\r' + 'expanding right stem_len ' +str(stem_len) + ' chr' + str(chrom), end='        ')

    DR_5mers_expand[chrom] = pd.concat(DR_5mers_expand[chrom])
    del DR_5mers_expand[chrom]['seq_L']; del DR_5mers_expand[chrom]['seq_R']
    
    # Sorting by n_shifts keeps longer stem_len first in list, allowing drop_duplicates(keep = 'first')
    DR_5mers_expand[chrom] = DR_5mers_expand[chrom].sort_values(by = ['L_start', 'R_end', 'n_shifts']).reset_index(drop = True)
    DR_5mers_expand[chrom]['Sequence'] = [reference_genome_masked_STR[chrom][start:end] for start, end in zip(DR_5mers_expand[chrom]['L_start'], DR_5mers_expand[chrom]['R_end'])]
    DR_5mers_expand[chrom] = DR_5mers_expand[chrom].loc[DR_5mers_expand[chrom]['Sequence'].str.count('N') == 0]
    DR_5mers_expand[chrom] = DR_5mers_expand[chrom].drop_duplicates(subset = ['L_start', 'Sequence'], keep = 'first')
    
    DR_5mers_expand[chrom][['L_start', 'L_end', 'R_start', 'R_end']].to_csv('./custom_db/temp/DR_SL'+str(N)+'_SP'+str(max_spacer)+'_expanded_chrom'+str(chrom)+'.csv.gz', compression = 'gzip', index = False)
    
    print('\r' + 'finished chr' + str(chrom), end='        ')

#### Search for imperfect DRs, expanding left and right <a name="DB_DR_imperfect"></a>

[Return to Table of Contents](#TOC)

In [None]:
# Allow one imperfection, then expand again left and right
def DR_expand_imperfect_L_R(chrom):
    DR_expand_combined = pd.read_csv('./custom_db/temp/DR_SL'+str(N)+'_SP'+str(max_spacer)+'_expanded_chrom'+str(chrom)+'.csv.gz', compression = 'gzip')

    DR_expand_combined['chrom'] = chrom
    DR_expand_combined['spacer'] = DR_expand_combined['R_start'] - (DR_expand_combined['L_end'])

    DR_5mers_expand_imperfect = dict()
    imp_len = 0
    DR_5mers_expand_imperfect[imp_len] = DR_expand_combined.copy()

    DR_5mers_expand_imperfect[imp_len]['imp_seq_L'] = [reference_genome_masked_STR[chrom][start-2:start-1] for chrom, start in zip(DR_5mers_expand_imperfect[imp_len]['chrom'], DR_5mers_expand_imperfect[imp_len]['L_start'])]
    DR_5mers_expand_imperfect[imp_len]['imp_seq_R'] = [reference_genome_masked_STR[chrom][start-2:start-1] for chrom, start in zip(DR_5mers_expand_imperfect[imp_len]['chrom'], DR_5mers_expand_imperfect[imp_len]['R_start'])]

    DR_5mers_expand_imperfect[imp_len+1] = DR_5mers_expand_imperfect[imp_len].loc[(DR_5mers_expand_imperfect[imp_len]['imp_seq_L'] == DR_5mers_expand_imperfect[imp_len]['imp_seq_R']) & (DR_5mers_expand_imperfect[imp_len]['imp_seq_L'].str.count('N') == 0) & (DR_5mers_expand_imperfect[imp_len]['imp_seq_R'].str.count('N') == 0) & (DR_5mers_expand_imperfect[imp_len]['spacer'] > 1)]
    DR_5mers_expand_imperfect[imp_len] = DR_5mers_expand_imperfect[imp_len].loc[(DR_5mers_expand_imperfect[imp_len]['imp_seq_L'] != DR_5mers_expand_imperfect[imp_len]['imp_seq_R']) | (DR_5mers_expand_imperfect[imp_len]['imp_seq_L'].str.count('N') != 0) | (DR_5mers_expand_imperfect[imp_len]['imp_seq_R'].str.count('N') != 0) | (DR_5mers_expand_imperfect[imp_len]['spacer'] <= 1)]

    imp_len +=1
    DR_5mers_expand_imperfect[imp_len]['L_start'] = DR_5mers_expand_imperfect[imp_len]['L_start'] - 2
    DR_5mers_expand_imperfect[imp_len]['R_start'] = DR_5mers_expand_imperfect[imp_len]['R_start'] - 2
    DR_5mers_expand_imperfect[imp_len]['spacer'] = DR_5mers_expand_imperfect[imp_len]['spacer'] - 2


    list_len = len(DR_5mers_expand_imperfect[imp_len])
    while list_len > 0:
        DR_5mers_expand_imperfect[imp_len]['imp_seq_L'] = [reference_genome_masked_STR[chrom][start-1:start] for chrom, start in zip(DR_5mers_expand_imperfect[imp_len]['chrom'], DR_5mers_expand_imperfect[imp_len]['L_start'])]
        DR_5mers_expand_imperfect[imp_len]['imp_seq_R'] = [reference_genome_masked_STR[chrom][start-1:start] for chrom, start in zip(DR_5mers_expand_imperfect[imp_len]['chrom'], DR_5mers_expand_imperfect[imp_len]['R_start'])]

        DR_5mers_expand_imperfect[imp_len+1] = DR_5mers_expand_imperfect[imp_len].loc[(DR_5mers_expand_imperfect[imp_len]['imp_seq_L'] == DR_5mers_expand_imperfect[imp_len]['imp_seq_R']) & (DR_5mers_expand_imperfect[imp_len]['imp_seq_L'].str.count('N') == 0) & (DR_5mers_expand_imperfect[imp_len]['imp_seq_R'].str.count('N') == 0) & (DR_5mers_expand_imperfect[imp_len]['spacer'] > 0)]
        DR_5mers_expand_imperfect[imp_len] = DR_5mers_expand_imperfect[imp_len].loc[(DR_5mers_expand_imperfect[imp_len]['imp_seq_L'] != DR_5mers_expand_imperfect[imp_len]['imp_seq_R']) | (DR_5mers_expand_imperfect[imp_len]['imp_seq_L'].str.count('N') != 0) | (DR_5mers_expand_imperfect[imp_len]['imp_seq_R'].str.count('N') != 0) | (DR_5mers_expand_imperfect[imp_len]['spacer'] <= 0)]

        imp_len +=1
        list_len = len(DR_5mers_expand_imperfect[imp_len])
        DR_5mers_expand_imperfect[imp_len]['L_start'] = DR_5mers_expand_imperfect[imp_len]['L_start'] - 1
        DR_5mers_expand_imperfect[imp_len]['R_start'] = DR_5mers_expand_imperfect[imp_len]['R_start'] - 1
        DR_5mers_expand_imperfect[imp_len]['spacer'] = DR_5mers_expand_imperfect[imp_len]['spacer'] - 1

        print('\r' + 'chrom ' + str(chrom) + ' expanding left imp_len ' +str(imp_len), end='        ')

    DR_5mers_expand_imperfect_out = pd.concat(DR_5mers_expand_imperfect).reset_index(drop = True)
    DR_5mers_expand_imperfect_out = DR_5mers_expand_imperfect_out.loc[DR_5mers_expand_imperfect_out['spacer'] >= 0]

    DR_5mers_expand_imperfect = dict()
    imp_len = 0
    DR_5mers_expand_imperfect[imp_len] = DR_5mers_expand_imperfect_out.copy()

    DR_5mers_expand_imperfect[imp_len]['imp_seq_L'] = [reference_genome_masked_STR[chrom][end+1:end+2] for chrom, end in zip(DR_5mers_expand_imperfect[imp_len]['chrom'], DR_5mers_expand_imperfect[imp_len]['L_end'])]
    DR_5mers_expand_imperfect[imp_len]['imp_seq_R'] = [reference_genome_masked_STR[chrom][end+1:end+2] for chrom, end in zip(DR_5mers_expand_imperfect[imp_len]['chrom'], DR_5mers_expand_imperfect[imp_len]['R_end'])]

    DR_5mers_expand_imperfect[imp_len+1] = DR_5mers_expand_imperfect[imp_len].loc[(DR_5mers_expand_imperfect[imp_len]['imp_seq_L'] == DR_5mers_expand_imperfect[imp_len]['imp_seq_R']) & (DR_5mers_expand_imperfect[imp_len]['imp_seq_L'].str.count('N') == 0) & (DR_5mers_expand_imperfect[imp_len]['imp_seq_R'].str.count('N') == 0) & (DR_5mers_expand_imperfect[imp_len]['spacer'] > 1)]
    DR_5mers_expand_imperfect[imp_len] = DR_5mers_expand_imperfect[imp_len].loc[(DR_5mers_expand_imperfect[imp_len]['imp_seq_L'] != DR_5mers_expand_imperfect[imp_len]['imp_seq_R']) | (DR_5mers_expand_imperfect[imp_len]['imp_seq_L'].str.count('N') != 0) | (DR_5mers_expand_imperfect[imp_len]['imp_seq_R'].str.count('N') != 0) | (DR_5mers_expand_imperfect[imp_len]['spacer'] <= 1)]

    imp_len +=1
    DR_5mers_expand_imperfect[imp_len]['L_end'] = DR_5mers_expand_imperfect[imp_len]['L_end'] + 2
    DR_5mers_expand_imperfect[imp_len]['R_end'] = DR_5mers_expand_imperfect[imp_len]['R_end'] + 2
    DR_5mers_expand_imperfect[imp_len]['spacer'] = DR_5mers_expand_imperfect[imp_len]['spacer'] - 2


    list_len = len(DR_5mers_expand_imperfect[imp_len])
    while list_len > 0:
        DR_5mers_expand_imperfect[imp_len]['imp_seq_L'] = [reference_genome_masked_STR[chrom][end:end+1] for chrom, end in zip(DR_5mers_expand_imperfect[imp_len]['chrom'], DR_5mers_expand_imperfect[imp_len]['L_end'])]
        DR_5mers_expand_imperfect[imp_len]['imp_seq_R'] = [reference_genome_masked_STR[chrom][end:end+1] for chrom, end in zip(DR_5mers_expand_imperfect[imp_len]['chrom'], DR_5mers_expand_imperfect[imp_len]['R_end'])]

        DR_5mers_expand_imperfect[imp_len+1] = DR_5mers_expand_imperfect[imp_len].loc[(DR_5mers_expand_imperfect[imp_len]['imp_seq_L'] == DR_5mers_expand_imperfect[imp_len]['imp_seq_R']) & (DR_5mers_expand_imperfect[imp_len]['imp_seq_L'].str.count('N') == 0) & (DR_5mers_expand_imperfect[imp_len]['imp_seq_R'].str.count('N') == 0) & (DR_5mers_expand_imperfect[imp_len]['spacer'] > 0)]
        DR_5mers_expand_imperfect[imp_len] = DR_5mers_expand_imperfect[imp_len].loc[(DR_5mers_expand_imperfect[imp_len]['imp_seq_L'] != DR_5mers_expand_imperfect[imp_len]['imp_seq_R']) | (DR_5mers_expand_imperfect[imp_len]['imp_seq_L'].str.count('N') != 0) | (DR_5mers_expand_imperfect[imp_len]['imp_seq_R'].str.count('N') != 0) | (DR_5mers_expand_imperfect[imp_len]['spacer'] <= 0)]

        imp_len +=1
        list_len = len(DR_5mers_expand_imperfect[imp_len])
        DR_5mers_expand_imperfect[imp_len]['L_end'] = DR_5mers_expand_imperfect[imp_len]['L_end'] + 1
        DR_5mers_expand_imperfect[imp_len]['R_end'] = DR_5mers_expand_imperfect[imp_len]['R_end'] + 1
        DR_5mers_expand_imperfect[imp_len]['spacer'] = DR_5mers_expand_imperfect[imp_len]['spacer'] - 1

        print('\r' + 'chrom ' + str(chrom) + 'expanding right imp_len ' +str(imp_len), end='        ')

    DR_5mers_expand_imperfect = pd.concat(DR_5mers_expand_imperfect).reset_index(drop = True)
    DR_5mers_expand_imperfect = DR_5mers_expand_imperfect.drop_duplicates(subset = ['L_start', 'R_start', 'L_end', 'R_end'])
    DR_5mers_expand_imperfect['stem_len'] = DR_5mers_expand_imperfect['L_end'] - DR_5mers_expand_imperfect['L_start']
    DR_expand_imperfect = DR_5mers_expand_imperfect.loc[DR_5mers_expand_imperfect['stem_len'] >9].copy()
    DR_expand_imperfect['seq_L'] = [reference_genome_masked_STR[chrom][start:end] for chrom, start, end in zip(DR_expand_imperfect['chrom'], DR_expand_imperfect['L_start'], DR_expand_imperfect['L_end'])]
    DR_expand_imperfect['seq_R'] = [reference_genome_masked_STR[chrom][start:end] for chrom, start, end in zip(DR_expand_imperfect['chrom'], DR_expand_imperfect['R_start'], DR_expand_imperfect['R_end'])]
    DR_expand_imperfect['Sequence'] = [reference_genome_masked_STR[chrom][start:end] for chrom, start, end in zip(DR_expand_imperfect['chrom'], DR_expand_imperfect['L_start'], DR_expand_imperfect['R_end'])]
    DR_expand_imperfect['freq'] = DR_expand_imperfect.groupby('seq_L')['seq_L'].transform('count')
    DR_expand_imperfect['spacer'] = DR_expand_imperfect['R_start'] - (DR_expand_imperfect['L_end'])
    DR_expand_imperfect = DR_expand_imperfect.loc[DR_expand_imperfect['Sequence'].str.count('N') == 0]
    DR_expand_imperfect = DR_expand_imperfect.sort_values(by = ['chrom', 'L_start', 'R_end']).reset_index(drop = True)
    del DR_expand_imperfect['imp_seq_L']; del DR_expand_imperfect['imp_seq_R']
    DR_expand_imperfect['#MM'] = [sum(1 for a, b in zip(seq1, seq2) if a != b) for seq1, seq2 in zip(DR_expand_imperfect['seq_L'], DR_expand_imperfect['seq_R'])]
    
    print('\r' + 'chrom ' + str(chrom) + 'done')
          
    return DR_expand_imperfect

DR_expand_imperfect_all = dict()
for chrom in range(chr_range,23):
    DR_expand_imperfect_all[chrom] = DR_expand_imperfect_L_R(chrom)

DR_expand_imperfect_all = pd.concat(DR_expand_imperfect_all)

DR_expand_imperfect_all = DR_expand_imperfect_all.sort_values(by = ['chrom', 'L_start', 'R_end']).reset_index(drop = True)

#### Save DR data <a name="DB_DR_save"></a>

[Return to Table of Contents](#TOC)

In [None]:
# Save DR database
DR_expand_imperfect_all.to_csv('./custom_db/direct_repeats_withoutSTRs_imperfect_chr'+str(chr_range)+'-22.csv.gz', compression = 'gzip', index = False)

## Z-DNA <a name="DB_ZDNA"></a>

### ZDNA version 1: <a name="DB_ZDNA_v1"></a>
- alternative purine–pyrimidine tracts of at least 10 nucleotides with the exclusion of AT/TA dinucleotides

[Return to Table of Contents](#TOC)

In [None]:
# Replace reference genome with purine/pyrimidine sequence
RY_genome = dict()
for chrom in range(chr_range,23):
    RY_genome[chrom] = reference_genome_masked_STR[chrom].replace('A', 'R').replace('G', 'R').replace('C', 'Y').replace('T', 'Y')

In [None]:
# Find all alternating purine/pyrimidine sequences of 8nt length
found_Zmers = dict()
for chrom in range(chr_range,23):
    found_Zmers[chrom] = pd.DataFrame()
    current_Z_8mers = [match.start(0) for match in re.finditer('RYRYRYRY', RY_genome[chrom], overlapped = True)]
    found_Zmers[chrom]['start'] = current_Z_8mers
    found_Zmers[chrom]['end'] = found_Zmers[chrom]['start'] + 8
    found_Zmers[chrom]['chrom'] = chrom
    print('\r' + ' chr' + str(chrom), end='        ')

found_Zmers = pd.concat(found_Zmers)

found_Zmers['Sequence'] = [reference_genome_masked_STR[chrom][start:end] for chrom, start, end in zip(found_Zmers['chrom'], found_Zmers['start'], found_Zmers['end'])]
found_Zmers['RY_seq'] = [RY_genome[chrom][start:end] for chrom, start, end in zip(found_Zmers['chrom'], found_Zmers['start'], found_Zmers['end'])]

In [None]:
# Exclude sequences with AT dinucleotides
found_Zmers_noAT = found_Zmers.loc[(found_Zmers['Sequence'].str.count('AT') == 0) & (found_Zmers['Sequence'].str.count('TA') == 0)]

In [None]:
# Expand Z-DNA left
Zmers_expand_noAT = dict()
expand_left = found_Zmers_noAT.copy()
list_len = len(expand_left)
pos = 0
while len(expand_left) > 0:
    expand_left['Y_base'] = [RY_genome[chrom][start-1:start] for chrom, start in zip(expand_left['chrom'], expand_left['start'])]
    expand_left['R_base'] = [RY_genome[chrom][start-2:start-1] for chrom, start in zip(expand_left['chrom'], expand_left['start'])]
    expand_left['seq-1'] = [reference_genome_masked_STR[chrom][start-1:end] for chrom, start, end in zip(expand_left['chrom'], expand_left['start'], expand_left['end'])]
    expand_left['seq-2'] = [reference_genome_masked_STR[chrom][start-2:end] for chrom, start, end in zip(expand_left['chrom'], expand_left['start'], expand_left['end'])]

    Zmers_expand_noAT[pos] = expand_left.loc[(expand_left['Y_base'] != 'Y') | (expand_left['seq-1'].str.count('AT') != 0) | (expand_left['seq-1'].str.count('TA') != 0)].copy()
    expand_left = expand_left.loc[(expand_left['Y_base'] == 'Y') & (expand_left['seq-1'].str.count('AT') == 0) & (expand_left['seq-1'].str.count('TA') == 0)]
    Zmers_expand_noAT[pos-1] = expand_left.loc[(expand_left['R_base'] != 'R') | (expand_left['seq-2'].str.count('AT') != 0) | (expand_left['seq-2'].str.count('TA') != 0)].copy()
    Zmers_expand_noAT[pos-1]['start'] = Zmers_expand_noAT[pos-1]['start'] -1
    expand_left = expand_left.loc[(expand_left['Y_base'] == 'Y') & (expand_left['R_base'] == 'R') & (expand_left['seq-2'].str.count('AT') == 0) & (expand_left['seq-2'].str.count('TA') == 0)]
    expand_left['start'] = expand_left['start'] -2
    pos += -2
    print('\r' + 'expanding left ' + str(pos) + ' #' + str(len(expand_left)), end = '         ')

In [None]:
# Expand Z-DNA right    
expand_right = pd.concat(Zmers_expand_noAT)
Zmers_expand_noAT = dict()
list_len = len(expand_right)
pos = 0
while len(expand_right) > 0:
    expand_right['R_base'] = [RY_genome[chrom][end:end+1] for chrom, end in zip(expand_right['chrom'], expand_right['end'])]
    expand_right['Y_base'] = [RY_genome[chrom][end+1:end+2] for chrom, end in zip(expand_right['chrom'], expand_right['end'])]
    expand_right['seq+1'] = [reference_genome_masked_STR[chrom][start:end+1] for chrom, start, end in zip(expand_right['chrom'], expand_right['start'], expand_right['end'])]
    expand_right['seq+2'] = [reference_genome_masked_STR[chrom][start:end+2] for chrom, start, end in zip(expand_right['chrom'], expand_right['start'], expand_right['end'])]

    Zmers_expand_noAT[pos] = expand_right.loc[(expand_right['R_base'] != 'R')  | (expand_right['seq+1'].str.count('AT') != 0) | (expand_right['seq+1'].str.count('TA') != 0)].copy()
    expand_right = expand_right.loc[(expand_right['R_base'] == 'R') & (expand_right['seq+1'].str.count('AT') == 0) & (expand_right['seq+1'].str.count('TA') == 0)]
    Zmers_expand_noAT[pos-1] = expand_right.loc[(expand_right['Y_base'] != 'Y') | (expand_right['seq+2'].str.count('AT') != 0) | (expand_right['seq+2'].str.count('TA') != 0)].copy()
    Zmers_expand_noAT[pos-1]['end'] = Zmers_expand_noAT[pos-1]['end'] +1
    expand_right = expand_right.loc[(expand_right['R_base'] == 'R') & (expand_right['Y_base'] == 'Y') & (expand_right['seq+2'].str.count('AT') == 0) & (expand_right['seq+2'].str.count('TA') == 0)]
    expand_right['end'] = expand_right['end'] +2
    pos += 2
    print('\r' + 'expanding right ' + str(pos) + ' #' + str(len(expand_right)), end = '         ')
Zmers_expand_noAT = pd.concat(Zmers_expand_noAT)
del Zmers_expand_noAT['Y_base']; del Zmers_expand_noAT['R_base']; del Zmers_expand_noAT['seq-1']; del Zmers_expand_noAT['seq-2']; del Zmers_expand_noAT['seq+1']; del Zmers_expand_noAT['seq+2']

In [None]:
# Drop duplicates and annotate
Zmers_expand_noAT = Zmers_expand_noAT.drop_duplicates(subset = ['chrom', 'start', 'end'], keep = 'first')
Zmers_expand_noAT['Sequence'] = [reference_genome_masked_STR[chrom][start:end] for chrom, start, end in zip(Zmers_expand_noAT['chrom'], Zmers_expand_noAT['start'], Zmers_expand_noAT['end'])]
Zmers_expand_noAT['RY_seq'] = [RY_genome[chrom][start:end] for chrom, start, end in zip(Zmers_expand_noAT['chrom'], Zmers_expand_noAT['start'], Zmers_expand_noAT['end'])]
Zmers_expand_noAT = Zmers_expand_noAT.sort_values(by = ['chrom', 'start', 'end']).reset_index(drop = True)
Zmers_expand_noAT['length'] = Zmers_expand_noAT['Sequence'].str.len()

In [None]:
# Minimum length for Z-DNA
Zmers_expand_noAT = Zmers_expand_noAT.loc[Zmers_expand_noAT['length'] >= 10]

In [None]:
# Save Z-DNA database
Zmers_expand_noAT.to_csv('./custom_db/temp/ZDNA_noSTRs_noAT_chr'+str(chr_range)+'-22.csv.gz', compression = 'gzip', index = False)

### ZDNA version 2: <a name="DB_ZDNA_v2"></a>
- G followed by Y (C or T) for at least 10 nt

[Return to Table of Contents](#TOC)

#### G strand

In [None]:
# Replace pyrimidines in reference genome with Y
AGY_genome = dict()
for chrom in range(chr_range,23):
    AGY_genome[chrom] = reference_genome_masked_STR[chrom].replace('C', 'Y').replace('T', 'Y')

# Find all alternating G/pyrimidine sequences of 8nt length
found_Zmers = dict()
for chrom in range(chr_range,23):
    found_Zmers[chrom] = pd.DataFrame()
    current_Z_8mers = [match.start(0) for match in re.finditer('GYGYGYGY', AGY_genome[chrom], overlapped = True)]
    found_Zmers[chrom]['start'] = current_Z_8mers
    found_Zmers[chrom]['end'] = found_Zmers[chrom]['start'] + 8
    found_Zmers[chrom]['chrom'] = chrom
    print('\r' + ' chr' + str(chrom), end='        ')
found_Zmers = pd.concat(found_Zmers)

found_Zmers['Sequence'] = [reference_genome_masked_STR[chrom][start:end] for chrom, start, end in zip(found_Zmers['chrom'], found_Zmers['start'], found_Zmers['end'])]
found_Zmers['AGY_seq'] = [AGY_genome[chrom][start:end] for chrom, start, end in zip(found_Zmers['chrom'], found_Zmers['start'], found_Zmers['end'])]

In [None]:
# Expand Z-DNA left
Zmers_expand_Gstrand = dict()
expand_left = found_Zmers.copy()
list_len = len(expand_left)
pos = 0
while len(expand_left) > 0:
    expand_left['Y_base'] = [AGY_genome[chrom][start-1:start] for chrom, start in zip(expand_left['chrom'], expand_left['start'])]
    expand_left['G_base'] = [AGY_genome[chrom][start-2:start-1] for chrom, start in zip(expand_left['chrom'], expand_left['start'])]
    Zmers_expand_Gstrand[pos] = expand_left.loc[expand_left['Y_base'] != 'Y'].copy()
    Zmers_expand_Gstrand[pos-1] = expand_left.loc[(expand_left['Y_base'] == 'Y') & (expand_left['G_base'] != 'G')].copy()
    Zmers_expand_Gstrand[pos-1]['start'] = Zmers_expand_Gstrand[pos-1]['start'] -1
    expand_left = expand_left.loc[(expand_left['Y_base'] == 'Y') & (expand_left['G_base'] == 'G')]
    expand_left['start'] = expand_left['start'] -2
    pos += -2
    print('\r' + 'expanding left ' + str(pos) + ' #' + str(len(expand_left)), end = '         ')

In [None]:
# Expand Z-DNA right
expand_right = pd.concat(Zmers_expand_Gstrand)
Zmers_expand_Gstrand = dict()
list_len = len(expand_right)
pos = 0
while len(expand_right) > 0:
    expand_right['G_base'] = [AGY_genome[chrom][end:end+1] for chrom, end in zip(expand_right['chrom'], expand_right['end'])]
    expand_right['Y_base'] = [AGY_genome[chrom][end+1:end+2] for chrom, end in zip(expand_right['chrom'], expand_right['end'])]
    Zmers_expand_Gstrand[pos] = expand_right.loc[expand_right['G_base'] != 'G'].copy()
    Zmers_expand_Gstrand[pos-1] = expand_right.loc[(expand_right['G_base'] == 'G') & (expand_right['Y_base'] != 'Y')].copy()
    Zmers_expand_Gstrand[pos-1]['end'] = Zmers_expand_Gstrand[pos-1]['end'] +1
    expand_right = expand_right.loc[(expand_right['G_base'] == 'G') & (expand_right['Y_base'] == 'Y')]
    expand_right['end'] = expand_right['end'] +2
    pos += 2
    print('\r' + 'expanding right ' + str(pos) + ' #' + str(len(expand_right)), end = '         ')
Zmers_expand_Gstrand = pd.concat(Zmers_expand_Gstrand)
del Zmers_expand_Gstrand['Y_base']; del Zmers_expand_Gstrand['G_base']

In [None]:
# Drop duplicates and annotate
Zmers_expand_Gstrand = Zmers_expand_Gstrand.drop_duplicates(keep = 'first')
Zmers_expand_Gstrand['Sequence'] = [reference_genome_masked_STR[chrom][start:end] for chrom, start, end in zip(Zmers_expand_Gstrand['chrom'], Zmers_expand_Gstrand['start'], Zmers_expand_Gstrand['end'])]
Zmers_expand_Gstrand['AGY_seq'] = [AGY_genome[chrom][start:end] for chrom, start, end in zip(Zmers_expand_Gstrand['chrom'], Zmers_expand_Gstrand['start'], Zmers_expand_Gstrand['end'])]
Zmers_expand_Gstrand = Zmers_expand_Gstrand.sort_values(by = ['chrom', 'start', 'end']).reset_index(drop = True)
Zmers_expand_Gstrand['length'] = Zmers_expand_Gstrand['Sequence'].str.len()

In [None]:
# Minimum length for Z-DNA
Zmers_expand_Gstrand = Zmers_expand_Gstrand.loc[Zmers_expand_Gstrand['length'] >= 10]

#### C-strand

In [None]:
# Replace purines in reference genome with R
TCR_genome = dict()
for chrom in range(chr_range,23):
    TCR_genome[chrom] = reference_genome_masked_STR[chrom].replace('A', 'R').replace('G', 'R')

# Find all alternating C/purine sequences of 8nt length
found_Zmers = dict()
for chrom in range(chr_range,23):
    found_Zmers[chrom] = pd.DataFrame()
    current_Z_8mers = [match.start(0) for match in re.finditer('CRCRCRCR', TCR_genome[chrom], overlapped = True)]
    found_Zmers[chrom]['start'] = current_Z_8mers
    found_Zmers[chrom]['end'] = found_Zmers[chrom]['start'] + 8
    found_Zmers[chrom]['chrom'] = chrom
    print('\r' + ' chr' + str(chrom), end='        ')
found_Zmers = pd.concat(found_Zmers)
found_Zmers['Sequence'] = [reference_genome_masked_STR[chrom][start:end] for chrom, start, end in zip(found_Zmers['chrom'], found_Zmers['start'], found_Zmers['end'])]
found_Zmers['TCR_seq'] = [TCR_genome[chrom][start:end] for chrom, start, end in zip(found_Zmers['chrom'], found_Zmers['start'], found_Zmers['end'])]

In [None]:
# Expand Z-DNA left
Zmers_expand_Cstrand = dict()
expand_left = found_Zmers.copy()
list_len = len(expand_left)
pos = 0
while len(expand_left) > 0:
    expand_left['R_base'] = [TCR_genome[chrom][start-1:start] for chrom, start in zip(expand_left['chrom'], expand_left['start'])]
    expand_left['C_base'] = [TCR_genome[chrom][start-2:start-1] for chrom, start in zip(expand_left['chrom'], expand_left['start'])]
    Zmers_expand_Cstrand[pos] = expand_left.loc[expand_left['R_base'] != 'R'].copy()
    Zmers_expand_Cstrand[pos-1] = expand_left.loc[(expand_left['R_base'] == 'R') & (expand_left['C_base'] != 'C')].copy()
    Zmers_expand_Cstrand[pos-1]['start'] = Zmers_expand_Cstrand[pos-1]['start'] -1
    expand_left = expand_left.loc[(expand_left['R_base'] == 'R') & (expand_left['C_base'] == 'C')]
    expand_left['start'] = expand_left['start'] -2
    pos += -2
    print('\r' + 'expanding left ' + str(pos) + ' #' + str(len(expand_left)), end = '         ')

In [None]:
# Expand Z-DNA right
expand_right = pd.concat(Zmers_expand_Cstrand)
Zmers_expand_Cstrand = dict()
list_len = len(expand_right)
pos = 0
while len(expand_right) > 0:
    expand_right['C_base'] = [TCR_genome[chrom][end:end+1] for chrom, end in zip(expand_right['chrom'], expand_right['end'])]
    expand_right['R_base'] = [TCR_genome[chrom][end+1:end+2] for chrom, end in zip(expand_right['chrom'], expand_right['end'])]
    Zmers_expand_Cstrand[pos] = expand_right.loc[expand_right['C_base'] != 'C'].copy()
    Zmers_expand_Cstrand[pos-1] = expand_right.loc[(expand_right['C_base'] == 'C') & (expand_right['R_base'] != 'R')].copy()
    Zmers_expand_Cstrand[pos-1]['end'] = Zmers_expand_Cstrand[pos-1]['end'] +1
    expand_right = expand_right.loc[(expand_right['C_base'] == 'C') & (expand_right['R_base'] == 'R')]
    expand_right['end'] = expand_right['end'] +2
    pos += 2
    print('\r' + 'expanding right ' + str(pos) + ' #' + str(len(expand_right)), end = '         ')
Zmers_expand_Cstrand = pd.concat(Zmers_expand_Cstrand)
del Zmers_expand_Cstrand['R_base']; del Zmers_expand_Cstrand['C_base']

In [None]:
# Drop duplicates and annotate
Zmers_expand_Cstrand = Zmers_expand_Cstrand.drop_duplicates(keep = 'first')
Zmers_expand_Cstrand['Sequence'] = [reference_genome_masked_STR[chrom][start:end] for chrom, start, end in zip(Zmers_expand_Cstrand['chrom'], Zmers_expand_Cstrand['start'], Zmers_expand_Cstrand['end'])]
Zmers_expand_Cstrand['TCR_seq'] = [TCR_genome[chrom][start:end] for chrom, start, end in zip(Zmers_expand_Cstrand['chrom'], Zmers_expand_Cstrand['start'], Zmers_expand_Cstrand['end'])]
Zmers_expand_Cstrand = Zmers_expand_Cstrand.sort_values(by = ['chrom', 'start', 'end']).reset_index(drop = True)
Zmers_expand_Cstrand['length'] = Zmers_expand_Cstrand['Sequence'].str.len()

In [None]:
# Minimum length for Z-DNA
Zmers_expand_Cstrand = Zmers_expand_Cstrand.loc[Zmers_expand_Cstrand['length'] >= 10]

#### Combine G and C strands

In [None]:
Zmers_expand_Gstrand['Strand'] = 'G'
Zmers_expand_Cstrand['Strand'] = 'C'
Zmers_expand = pd.concat([Zmers_expand_Gstrand, Zmers_expand_Cstrand])
del Zmers_expand['AGY_seq']; del Zmers_expand['TCR_seq']
Zmers_expand = Zmers_expand.drop_duplicates(subset = ['chrom', 'start', 'end']).sort_values(by = ['chrom', 'start', 'end'])

In [None]:
Zmers_expand.to_csv('./custom_db/temp/ZDNA_noSTRs_GY_chr'+str(chr_range)+'-22.csv.gz', compression = 'gzip', index = False)

### Combine RY and GY versions <a name="DB_ZDNA_all"></a>
- all motifs in GY version are contained within RY version
- GY motifs can extend into longer RY motifs

[Return to Table of Contents](#TOC)


In [None]:
Zmers_expand_noAT = pd.read_csv('./custom_db/temp/ZDNA_noSTRs_noAT_chr'+str(chr_range)+'-22.csv.gz', compression = 'gzip', usecols = ['start', 'end', 'chrom', 'Sequence', 'length'])
Zmers_expand_noAT['repeat'] = 'RY'
Zmers_expand = pd.read_csv('./custom_db/temp/ZDNA_noSTRs_GY_chr'+str(chr_range)+'-22.csv.gz', compression = 'gzip', usecols = ['start', 'end', 'chrom', 'Sequence', 'length', 'Strand'])
Zmers_expand['repeat'] = 'GY'

Zmers_expand_noAT = measure_distance(Zmers_expand_noAT, Zmers_expand, 'GY')

# RY does not overlap GY
Zmers_expand_noAT_noGYoverlap = Zmers_expand_noAT.loc[Zmers_expand_noAT['GY_distance_min'] >0]

Zmers_expand_noAT_GYoverlap = Zmers_expand_noAT.loc[Zmers_expand_noAT['GY_distance_min'] <1]
# GY and RY are identical
Zmers_expand_fulloverlap = pd.concat([Zmers_expand, Zmers_expand_noAT_GYoverlap]).loc[pd.concat([Zmers_expand, Zmers_expand_noAT_GYoverlap]).duplicated(subset = ['chrom', 'start', 'end'], keep = 'last')]
# GY is contained within RY
Zmers_expand_partialoverlap = pd.concat([Zmers_expand, Zmers_expand, Zmers_expand_noAT_GYoverlap]).drop_duplicates(subset = ['chrom', 'start', 'end'], keep = False)

# Check whether partial overlap is with G or C strand
Zmers_expand_partialoverlap = measure_distance(Zmers_expand_partialoverlap, Zmers_expand.loc[Zmers_expand['Strand'] == 'G'], 'Gstrand')
Zmers_expand_partialoverlap = measure_distance(Zmers_expand_partialoverlap, Zmers_expand.loc[Zmers_expand['Strand'] == 'C'], 'Cstrand')
Zmers_expand_partialoverlap_Gstrand = Zmers_expand_partialoverlap.loc[(Zmers_expand_partialoverlap['Gstrand_distance_min'] <1) & (Zmers_expand_partialoverlap['Cstrand_distance_min'] >0)].copy()
Zmers_expand_partialoverlap_Gstrand['Strand'] = 'G'
Zmers_expand_partialoverlap_Cstrand = Zmers_expand_partialoverlap.loc[(Zmers_expand_partialoverlap['Gstrand_distance_min'] >0) & (Zmers_expand_partialoverlap['Cstrand_distance_min'] <1)].copy()
Zmers_expand_partialoverlap_Cstrand['Strand'] = 'C'
Zmers_expand_partialoverlap_bothstrands = Zmers_expand_partialoverlap.loc[(Zmers_expand_partialoverlap['Gstrand_distance_min'] <1) & (Zmers_expand_partialoverlap['Cstrand_distance_min'] <1)].copy()

# Combine all motifs
Zmers_all = pd.concat([Zmers_expand_noAT_noGYoverlap, Zmers_expand_fulloverlap, Zmers_expand_partialoverlap_Gstrand, Zmers_expand_partialoverlap_Cstrand, Zmers_expand_partialoverlap_bothstrands])
Zmers_all = Zmers_all[['start', 'end', 'chrom', 'Sequence', 'length', 'Strand']].copy().sort_values(by = ['chrom', 'start']).reset_index(drop = True)

# Save
Zmers_all.to_csv('./custom_db/ZDNA_noSTRs_chr'+str(chr_range)+'-22.csv.gz', compression = 'gzip', index = False)

## G4 motifs <a name="DB_G4"></a>
#### Uses G4-seq data, then cleans up based on motifs
- download files from https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM3003539
    - GSM3003539_Homo_all_w15_th-1_plus.hits.max.K.w50.25.bed.gz
    - GSM3003539_Homo_all_w15_th-1_minus.hits.max.K.w50.25.bed.gz
- download from https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM3003540
    - GSM3003540_Homo_all_w15_th-1_plus.hits.max.PDS.w50.35.bed.gz
    - GSM3003540_Homo_all_w15_th-1_minus.hits.max.PDS.w50.35.bed.gz
- place files in directory './custom_db/G4seq/'

[Return to Table of Contents](#TOC)

#### Liftover datasets from hg19 <a name="DB_G4_liftover"></a>

In [None]:
# Load G4-seq files
# Note: '-' strand corresponds to G strand
G4seq_PDS_R = pd.read_csv('./custom_db/G4seq/GSM3003540_Homo_all_w15_th-1_plus.hits.max.PDS.w50.35.bed.gz', header = None, sep = '\t', compression = 'gzip')
G4seq_PDS_R.columns = ['chrom', 'start', 'end', 'score']

G4seq_PDS_F = pd.read_csv('./custom_db/G4seq/GSM3003540_Homo_all_w15_th-1_minus.hits.max.PDS.w50.35.bed.gz', header = None, sep = '\t', compression = 'gzip')
G4seq_PDS_F.columns = ['chrom', 'start', 'end', 'score']

G4seq_K_R = pd.read_csv('./custom_db/G4seq/GSM3003539_Homo_all_w15_th-1_plus.hits.max.K.w50.25.bed.gz', header = None, sep = '\t', compression = 'gzip')
G4seq_K_R.columns = ['chrom', 'start', 'end', 'score']

G4seq_K_F = pd.read_csv('./custom_db/G4seq/GSM3003539_Homo_all_w15_th-1_minus.hits.max.K.w50.25.bed.gz', header = None, sep = '\t', compression = 'gzip')
G4seq_K_F.columns = ['chrom', 'start', 'end', 'score']

In [None]:
# Liftover to hg38
lo = LiftOver('hg19', 'hg38')
for repeats in [G4seq_PDS_F, G4seq_PDS_R, G4seq_K_F, G4seq_K_R]:
    repeats['hg38_lo'] = [lo.convert_coordinate(chrom, pos) for chrom,pos in zip(repeats['chrom'], repeats['start'])]
    repeats['hg38_chr_s'] = [pos[0][0] if len(pos) >0 else np.nan for pos in repeats['hg38_lo']]
    repeats['hg38_start'] = [pos[0][1] if len(pos) >0 else np.nan for pos in repeats['hg38_lo']]
    repeats['hg38_lo'] = [lo.convert_coordinate(chrom, pos) for chrom,pos in zip(repeats['chrom'], repeats['end'])]
    repeats['hg38_chr_e'] = [pos[0][0] if len(pos) >0 else np.nan for pos in repeats['hg38_lo']]
    repeats['hg38_end'] = [pos[0][1] if len(pos) >0 else np.nan for pos in repeats['hg38_lo']]

# Filter for split chromosomes and start > end
G4seq_PDS_F = G4seq_PDS_F.loc[(G4seq_PDS_F['hg38_chr_s'] == G4seq_PDS_F['hg38_chr_e']) & (G4seq_PDS_F['hg38_chr_s'].isin(['chr'+str(chrom) for chrom in range(1,23)])) & (G4seq_PDS_F['hg38_start'] < G4seq_PDS_F['hg38_end'])]
G4seq_PDS_R = G4seq_PDS_R.loc[(G4seq_PDS_R['hg38_chr_s'] == G4seq_PDS_R['hg38_chr_e']) & (G4seq_PDS_R['hg38_chr_s'].isin(['chr'+str(chrom) for chrom in range(1,23)])) & (G4seq_PDS_R['hg38_start'] < G4seq_PDS_R['hg38_end'])]
G4seq_K_F = G4seq_K_F.loc[(G4seq_K_F['hg38_chr_s'] == G4seq_K_F['hg38_chr_e']) & (G4seq_K_F['hg38_chr_s'].isin(['chr'+str(chrom) for chrom in range(1,23)])) & (G4seq_K_F['hg38_start'] < G4seq_K_F['hg38_end'])]
G4seq_K_R = G4seq_K_R.loc[(G4seq_K_R['hg38_chr_s'] == G4seq_K_R['hg38_chr_e']) & (G4seq_K_R['hg38_chr_s'].isin(['chr'+str(chrom) for chrom in range(1,23)])) & (G4seq_K_R['hg38_start'] < G4seq_K_R['hg38_end'])]

In [None]:
# Clean up
G4seq_PDS_F = G4seq_PDS_F[['hg38_chr_s', 'hg38_start', 'hg38_end', 'score']]
G4seq_PDS_F.columns = ['chrom', 'start', 'end', 'score']
G4seq_PDS_R = G4seq_PDS_R[['hg38_chr_s', 'hg38_start', 'hg38_end', 'score']]
G4seq_PDS_R.columns = ['chrom', 'start', 'end', 'score']
G4seq_K_F = G4seq_K_F[['hg38_chr_s', 'hg38_start', 'hg38_end', 'score']]
G4seq_K_F.columns = ['chrom', 'start', 'end', 'score']
G4seq_K_R = G4seq_K_R[['hg38_chr_s', 'hg38_start', 'hg38_end', 'score']]
G4seq_K_R.columns = ['chrom', 'start', 'end', 'score']

G4seq_PDS_F = G4seq_PDS_F.loc[G4seq_PDS_F['chrom'].isin(['chr'+str(chrom) for chrom in range(1,23)])]
G4seq_PDS_F['chrom'] = [chrom[3:] for chrom in G4seq_PDS_F['chrom']]
G4seq_PDS_F['chrom'] = G4seq_PDS_F['chrom'].astype(int)
G4seq_PDS_R = G4seq_PDS_R.loc[G4seq_PDS_R['chrom'].isin(['chr'+str(chrom) for chrom in range(1,23)])]
G4seq_PDS_R['chrom'] = [chrom[3:] for chrom in G4seq_PDS_R['chrom']]
G4seq_PDS_R['chrom'] = G4seq_PDS_R['chrom'].astype(int)
G4seq_K_F = G4seq_K_F.loc[G4seq_K_F['chrom'].isin(['chr'+str(chrom) for chrom in range(1,23)])]
G4seq_K_F['chrom'] = [chrom[3:] for chrom in G4seq_K_F['chrom']]
G4seq_K_F['chrom'] = G4seq_K_F['chrom'].astype(int)
G4seq_K_R = G4seq_K_R.loc[G4seq_K_R['chrom'].isin(['chr'+str(chrom) for chrom in range(1,23)])]
G4seq_K_R['chrom'] = [chrom[3:] for chrom in G4seq_K_R['chrom']]
G4seq_K_R['chrom'] = G4seq_K_R['chrom'].astype(int)

G4seq_PDS_F['start'] = G4seq_PDS_F['start'].astype(int)
G4seq_PDS_F['end'] = G4seq_PDS_F['end'].astype(int)
G4seq_PDS_R['start'] = G4seq_PDS_R['start'].astype(int)
G4seq_PDS_R['end'] = G4seq_PDS_R['end'].astype(int)
G4seq_K_F['start'] = G4seq_K_F['start'].astype(int)
G4seq_K_F['end'] = G4seq_K_F['end'].astype(int)
G4seq_K_R['start'] = G4seq_K_R['start'].astype(int)
G4seq_K_R['end'] = G4seq_K_R['end'].astype(int)

#### Reduce overlaps within each condition <a name="DB_G4_combine"></a>
[Return to Table of Contents](#TOC)

In [None]:
# Within each categoriy, combine overlapping entries while retaining G4seq scores
G4seq_PDS_F_overlap = interval_overlap_per_chrom(G4seq_PDS_F)
G4seq_PDS_F_combined = pd.concat([G4seq_PDS_F, G4seq_PDS_F_overlap])
G4seq_PDS_F_changed = G4seq_PDS_F_combined.drop_duplicates(subset = ['chrom', 'start', 'end'], keep = False).dropna(subset = ['score'])
G4seq_PDS_F_unchanged = G4seq_PDS_F_combined.sort_values(by = ['chrom', 'start', 'end', 'score']).reset_index(drop = True)
G4seq_PDS_F_unchanged = G4seq_PDS_F_unchanged.loc[G4seq_PDS_F_unchanged.duplicated(subset = ['chrom', 'start', 'end'], keep = 'last')]
G4seq_PDS_F_changed_overlap = interval_overlap_per_chrom(G4seq_PDS_F_changed)
G4seq_PDS_F_changed_overlap['score'] = [G4seq_PDS_F_changed.loc[(G4seq_PDS_F_changed['chrom'] == chrom) & (G4seq_PDS_F_changed['start'] >= start) & (G4seq_PDS_F_changed['end'] <= end)]['score'].max() for chrom, start, end in zip(G4seq_PDS_F_changed_overlap['chrom'], G4seq_PDS_F_changed_overlap['start'], G4seq_PDS_F_changed_overlap['end'])]
G4seq_PDS_F = pd.concat([G4seq_PDS_F_unchanged, G4seq_PDS_F_changed_overlap]).sort_values(by = ['chrom', 'start', 'end']).reset_index(drop = True)

G4seq_PDS_R_overlap = interval_overlap_per_chrom(G4seq_PDS_R)
G4seq_PDS_R_combined = pd.concat([G4seq_PDS_R, G4seq_PDS_R_overlap])
G4seq_PDS_R_changed = G4seq_PDS_R_combined.drop_duplicates(subset = ['chrom', 'start', 'end'], keep = False).dropna(subset = ['score'])
G4seq_PDS_R_unchanged = G4seq_PDS_R_combined.sort_values(by = ['chrom', 'start', 'end', 'score']).reset_index(drop = True)
G4seq_PDS_R_unchanged = G4seq_PDS_R_unchanged.loc[G4seq_PDS_R_unchanged.duplicated(subset = ['chrom', 'start', 'end'], keep = 'last')]
G4seq_PDS_R_changed_overlap = interval_overlap_per_chrom(G4seq_PDS_R_changed)
G4seq_PDS_R_changed_overlap['score'] = [G4seq_PDS_R_changed.loc[(G4seq_PDS_R_changed['chrom'] == chrom) & (G4seq_PDS_R_changed['start'] >= start) & (G4seq_PDS_R_changed['end'] <= end)]['score'].max() for chrom, start, end in zip(G4seq_PDS_R_changed_overlap['chrom'], G4seq_PDS_R_changed_overlap['start'], G4seq_PDS_R_changed_overlap['end'])]
G4seq_PDS_R = pd.concat([G4seq_PDS_R_unchanged, G4seq_PDS_R_changed_overlap]).sort_values(by = ['chrom', 'start', 'end']).reset_index(drop = True)

G4seq_K_F_overlap = interval_overlap_per_chrom(G4seq_K_F)
G4seq_K_F_combined = pd.concat([G4seq_K_F, G4seq_K_F_overlap])
G4seq_K_F_changed = G4seq_K_F_combined.drop_duplicates(subset = ['chrom', 'start', 'end'], keep = False).dropna(subset = ['score'])
G4seq_K_F_unchanged = G4seq_K_F_combined.sort_values(by = ['chrom', 'start', 'end', 'score']).reset_index(drop = True)
G4seq_K_F_unchanged = G4seq_K_F_unchanged.loc[G4seq_K_F_unchanged.duplicated(subset = ['chrom', 'start', 'end'], keep = 'last')]
G4seq_K_F_changed_overlap = interval_overlap_per_chrom(G4seq_K_F_changed)
G4seq_K_F_changed_overlap['score'] = [G4seq_K_F_changed.loc[(G4seq_K_F_changed['chrom'] == chrom) & (G4seq_K_F_changed['start'] >= start) & (G4seq_K_F_changed['end'] <= end)]['score'].max() for chrom, start, end in zip(G4seq_K_F_changed_overlap['chrom'], G4seq_K_F_changed_overlap['start'], G4seq_K_F_changed_overlap['end'])]
G4seq_K_F = pd.concat([G4seq_K_F_unchanged, G4seq_K_F_changed_overlap]).sort_values(by = ['chrom', 'start', 'end']).reset_index(drop = True)

G4seq_K_R_overlap = interval_overlap_per_chrom(G4seq_K_R)
G4seq_K_R_combined = pd.concat([G4seq_K_R, G4seq_K_R_overlap])
G4seq_K_R_changed = G4seq_K_R_combined.drop_duplicates(subset = ['chrom', 'start', 'end'], keep = False).dropna(subset = ['score'])
G4seq_K_R_unchanged = G4seq_K_R_combined.sort_values(by = ['chrom', 'start', 'end', 'score']).reset_index(drop = True)
G4seq_K_R_unchanged = G4seq_K_R_unchanged.loc[G4seq_K_R_unchanged.duplicated(subset = ['chrom', 'start', 'end'], keep = 'last')]
G4seq_K_R_changed_overlap = interval_overlap_per_chrom(G4seq_K_R_changed)
G4seq_K_R_changed_overlap['score'] = [G4seq_K_R_changed.loc[(G4seq_K_R_changed['chrom'] == chrom) & (G4seq_K_R_changed['start'] >= start) & (G4seq_K_R_changed['end'] <= end)]['score'].max() for chrom, start, end in zip(G4seq_K_R_changed_overlap['chrom'], G4seq_K_R_changed_overlap['start'], G4seq_K_R_changed_overlap['end'])]
G4seq_K_R = pd.concat([G4seq_K_R_unchanged, G4seq_K_R_changed_overlap]).sort_values(by = ['chrom', 'start', 'end']).reset_index(drop = True)

#### Check for overlaps between PDS and K+ G4-seq conditions <a name="DB_G4_overlap"></a>
[Return to Table of Contents](#TOC)

In [None]:
# Label each condition
G4seq_PDS_F['status'] = 'PDS'
G4seq_PDS_F['Strand'] = '+'
G4seq_PDS_R['status'] = 'PDS'
G4seq_PDS_R['Strand'] = '-'

G4seq_K_F['status'] = 'K+'
G4seq_K_F['Strand'] = '+'
G4seq_K_R['status'] = 'K+'
G4seq_K_R['Strand'] = '-'

In [None]:
# Check for overlaps
G4seq_K_F = measure_distance(G4seq_K_F, G4seq_PDS_F, 'PDS')
G4seq_PDS_F = measure_distance(G4seq_PDS_F, G4seq_K_F, 'K')

G4seq_K_R = measure_distance(G4seq_K_R, G4seq_PDS_R, 'PDS')
G4seq_PDS_R = measure_distance(G4seq_PDS_R, G4seq_K_R, 'K')

# Use K+ condition entries for any overlaps, since this is the more sensitive test
G4seq_both_F = G4seq_K_F.loc[G4seq_K_F['PDS_distance_min'] <1].copy()
G4seq_both_F['status'] = 'both'
G4seq_both_R = G4seq_K_R.loc[G4seq_K_R['PDS_distance_min'] <1].copy()
G4seq_both_R['status'] = 'both'

G4seq_F = pd.concat([G4seq_both_F, G4seq_K_F.loc[G4seq_K_F['PDS_distance_min'] >0], G4seq_PDS_F.loc[G4seq_PDS_F['K_distance_min'] >0]])
G4seq_R = pd.concat([G4seq_both_R, G4seq_K_R.loc[G4seq_K_R['PDS_distance_min'] >0], G4seq_PDS_R.loc[G4seq_PDS_R['K_distance_min'] >0]])

G4seq_F = G4seq_F[['chrom', 'start', 'end', 'score', 'status', 'Strand']]
G4seq_R = G4seq_R[['chrom', 'start', 'end', 'score', 'status', 'Strand']]

#### Confirm strand orientation and locate motifs <a name="DB_G4_motifs"></a>
[Return to Table of Contents](#TOC)

In [None]:
# Select chromosomes of interest
G4seq_F = G4seq_F.loc[G4seq_F['chrom'].isin(range(chr_range,23))].copy()
G4seq_R = G4seq_R.loc[G4seq_R['chrom'].isin(range(chr_range,23))].copy()

In [None]:
# Remove very long entries
G4seq_F['length'] = G4seq_F['end'] - G4seq_F['start']
G4seq_R['length'] = G4seq_R['end'] - G4seq_R['start']

G4seq_F = G4seq_F.loc[G4seq_F['length'] <1000]
G4seq_R = G4seq_R.loc[G4seq_R['length'] <1000]

In [None]:
# Retrieve sequence and confirm orientation (using base-1 start position)
G4seq_F['Sequence'] = [reference_genome[chrom][start-1:end].upper() for chrom, start, end in zip(G4seq_F['chrom'], G4seq_F['start'], G4seq_F['end'])]
G4seq_R['Sequence'] = [reference_genome[chrom][start-1:end].upper() for chrom, start, end in zip(G4seq_R['chrom'], G4seq_R['start'], G4seq_R['end'])]

G4seq_F['GGG_count'] = [seq.count('GGG') for seq in G4seq_F['Sequence']]
G4seq_R['GGG_count'] = [seq.count('GGG') for seq in G4seq_R['Sequence']]

G4seq_F['CCC_count'] = [seq.count('CCC') for seq in G4seq_F['Sequence']]
G4seq_R['CCC_count'] = [seq.count('CCC') for seq in G4seq_R['Sequence']]

for df, name in zip([G4seq_F, G4seq_R], ['G4seq_F', 'G4seq_R']):
    print(name + ' ' + 'GGG ' + str(round(len(df.loc[df['GGG_count'] >3]) / len(df),3)) + ' CCC ' + str(round(len(df.loc[df['CCC_count'] >3]) / len(df), 3)))

In [None]:
# Filter to somewhat standard G4 motifs (at least 4 sets of GGG)
G4seq_filtered_F = G4seq_F.loc[G4seq_F['GGG_count'] >3].reset_index(drop = True).copy()
G4seq_filtered_R = G4seq_R.loc[G4seq_R['CCC_count'] >3].reset_index(drop = True).copy()

In [None]:
# Trim to start and end with 'GGG' (using base-0 start coordinates)
G4seq_filtered_F['start'] = [start -1 + len(seq.split('GGG')[0]) for start, seq in zip(G4seq_filtered_F['start'], G4seq_filtered_F['Sequence'])]
G4seq_filtered_F['end'] = [end - len(seq[::-1].split('GGG')[0]) for end, seq in zip(G4seq_filtered_F['end'], G4seq_filtered_F['Sequence'])]

G4seq_filtered_R['start'] = [start -1 + len(seq.split('CCC')[0]) for start, seq in zip(G4seq_filtered_R['start'], G4seq_filtered_R['Sequence'])]
G4seq_filtered_R['end'] = [end - len(seq[::-1].split('CCC')[0]) for end, seq in zip(G4seq_filtered_R['end'], G4seq_filtered_R['Sequence'])]

# Annotate with sequence and motif length
G4seq_filtered_F['Sequence'] = [reference_genome[chrom][start:end].upper() for chrom, start, end in zip(G4seq_filtered_F['chrom'], G4seq_filtered_F['start'], G4seq_filtered_F['end'])]
G4seq_filtered_R['Sequence'] = [reference_genome[chrom][start:end].upper() for chrom, start, end in zip(G4seq_filtered_R['chrom'], G4seq_filtered_R['start'], G4seq_filtered_R['end'])]

G4seq_filtered_F['length'] = G4seq_filtered_F['Sequence'].str.len()
G4seq_filtered_R['length'] = G4seq_filtered_R['Sequence'].str.len()

G4seq_filtered_F = G4seq_filtered_F.sort_values(by = ['chrom', 'start', 'end', 'score'])
G4seq_filtered_F = G4seq_filtered_F.drop_duplicates(subset = ['chrom', 'start', 'end', 'status'], keep = 'last')

G4seq_filtered_R = G4seq_filtered_R.sort_values(by = ['chrom', 'start', 'end', 'score'])
G4seq_filtered_R = G4seq_filtered_R.drop_duplicates(subset = ['chrom', 'start', 'end', 'status'], keep = 'last')

In [None]:
for repeat, name in zip([G4seq_filtered_F, G4seq_filtered_R], ['G4seq_filtered_F', 'G4seq_filtered_R']):
    print(name + ' ' + str(len(repeat)))

#### Save G4 database <a name="DB_G4_save"></a>
[Return to Table of Contents](#TOC)

In [None]:
# Combine FWD and REV
G4seq_filtered = pd.concat([G4seq_filtered_F, G4seq_filtered_R])
G4seq_filtered['length'] = G4seq_filtered['length'].astype(int)
G4seq_filtered['GGG_count'] = G4seq_filtered['GGG_count'].astype(int)
G4seq_filtered['CCC_count'] = G4seq_filtered['CCC_count'].astype(int)
G4seq_filtered = G4seq_filtered.sort_values(by = ['chrom', 'start', 'end']).reset_index(drop = True)

In [None]:
# Save
G4seq_filtered.to_csv('./custom_db/G4seq_filtered_chr'+str(chr_range)+'-22.csv.gz', compression = 'gzip', index = False)

## Combine all motifs into single database <a name="DB_all"></a>
[Return to Table of Contents](#TOC)

#### Load individual motif databases

In [None]:
# Load STR database
all_STRs = pd.read_csv('./custom_db/temp/STRs_custom_imperfect_1-9_chr'+str(chr_range)+'-22.csv.gz', compression = 'gzip')
all_STRs = all_STRs[['start', 'end', 'chrom', 'length', 'repeat', 'Sequence', 'Strand', 'status', 'repeat_frame_L']]
all_STRs['Type'] = 'STR'

In [None]:
# Load IR database
IR_expand_imperfect = pd.read_csv('./custom_db/inverted_repeats_withoutSTRs_imperfect_chr'+str(chr_range)+'-22.csv.gz', compression = 'gzip')
for col in ['L_start', 'R_start', 'chrom', 'stem_len', 'L_end', 'R_end', 'spacer', 'freq', '#MM']:
    IR_expand_imperfect[col] = IR_expand_imperfect[col].astype(int)

# Filter out direct/inverted/mirror multi-category repeats
IR_expand_imperfect['rev_seq_R'] = [seq[::-1] for seq in IR_expand_imperfect['seq_R']]
IR_MR = IR_expand_imperfect.loc[IR_expand_imperfect['seq_L'] == IR_expand_imperfect['rev_seq_R']]
IR_DR = IR_expand_imperfect.loc[IR_expand_imperfect['seq_L'] == IR_expand_imperfect['seq_R']]
IR_expand_imperfect = IR_expand_imperfect.loc[(IR_expand_imperfect['seq_L'] != IR_expand_imperfect['seq_R']) & (IR_expand_imperfect['seq_L'] != IR_expand_imperfect['rev_seq_R'])]

IR_expand_imperfect = IR_expand_imperfect[['chrom', 'L_start', 'R_end', 'L_end', 'R_start', 'stem_len', 'spacer', 'Sequence', 'seq_L', 'seq_R', 'RC_seq_R', '#MM']]
IR_expand_imperfect.columns = ['chrom', 'start', 'end', 'L_end', 'R_start', 'stem_len', 'spacer', 'Sequence', 'seq_L', 'seq_R', 'RC_seq_R', '#MM']
IR_expand_imperfect['Type'] = 'IR'

In [None]:
# Load MR database
MR_expand_imperfect = pd.read_csv('./custom_db//mirror_repeats_withoutSTRs_imperfect_chr'+str(chr_range)+'-22.csv.gz', compression = 'gzip')
for col in ['L_start', 'R_start', 'chrom', 'stem_len', 'L_end', 'R_end', 'spacer', 'freq', '#MM']:
    MR_expand_imperfect[col] = MR_expand_imperfect[col].astype(int)

# Filter out direct/inverted/mirror multi-category repeats
MR_expand_imperfect['RC_seq_R'] = [reverse_complement(seq) for seq in MR_expand_imperfect['seq_R']]
MR_IR = MR_expand_imperfect.loc[MR_expand_imperfect['seq_L'] == MR_expand_imperfect['RC_seq_R']]
MR_DR = MR_expand_imperfect.loc[MR_expand_imperfect['seq_L'] == MR_expand_imperfect['seq_R']]
MR_expand_imperfect = MR_expand_imperfect.loc[(MR_expand_imperfect['seq_L'] != MR_expand_imperfect['seq_R']) & (MR_expand_imperfect['seq_L'] != MR_expand_imperfect['RC_seq_R'])]

MR_expand_imperfect = MR_expand_imperfect[['chrom', 'L_start', 'R_end', 'L_end', 'R_start', 'stem_len', 'spacer', 'Sequence', 'seq_L', 'seq_R', 'rev_seq_R', '#MM']]
MR_expand_imperfect.columns = ['chrom', 'start', 'end', 'L_end', 'R_start', 'stem_len', 'spacer', 'Sequence', 'seq_L', 'seq_R', 'rev_seq_R', '#MM']
MR_expand_imperfect['Type'] = 'MR'

In [None]:
# Load DR database
DR_expand_imperfect = pd.read_csv('./custom_db/direct_repeats_withoutSTRs_imperfect_chr'+str(chr_range)+'-22.csv.gz', compression = 'gzip')
for col in ['L_start', 'R_start', 'chrom', 'stem_len', 'L_end', 'R_end', 'spacer', 'freq', '#MM']:
    DR_expand_imperfect[col] = DR_expand_imperfect[col].astype(int)

# Filter out direct/inverted/mirror multi-category repeats and duplicates
DR_expand_imperfect['rev_seq_R'] = [seq[::-1] for seq in DR_expand_imperfect['seq_R']]
DR_expand_imperfect['RC_seq_R'] = [reverse_complement(seq) for seq in DR_expand_imperfect['seq_R']]
DR_MR = DR_expand_imperfect.loc[DR_expand_imperfect['seq_L'] == DR_expand_imperfect['rev_seq_R']]
DR_IR = DR_expand_imperfect.loc[DR_expand_imperfect['seq_L'] == DR_expand_imperfect['RC_seq_R']]
DR_expand_imperfect = DR_expand_imperfect.loc[(DR_expand_imperfect['seq_L'] != DR_expand_imperfect['RC_seq_R']) & (DR_expand_imperfect['seq_L'] != DR_expand_imperfect['rev_seq_R'])]
del DR_expand_imperfect['rev_seq_R']; del DR_expand_imperfect['RC_seq_R']

# Exclude very long spacers
DR_expand_imperfect = DR_expand_imperfect.loc[DR_expand_imperfect['spacer'] <101]
# drop duplicates, keep longest stem
DR_expand_imperfect = DR_expand_imperfect.sort_values(by = ['chrom', 'L_start', 'R_end', 'stem_len']).drop_duplicates(subset = ['chrom', 'L_start', 'R_end'], keep = 'last')

DR_expand_imperfect = DR_expand_imperfect[['chrom', 'L_start', 'R_end', 'L_end', 'R_start', 'stem_len', 'spacer', 'Sequence', 'seq_L', 'seq_R', '#MM']]
DR_expand_imperfect.columns = ['chrom', 'start', 'end', 'L_end', 'R_start', 'stem_len', 'spacer', 'Sequence', 'seq_L', 'seq_R', '#MM']
DR_expand_imperfect['Type'] = 'DR'

In [None]:
# Load Z-DNA database
Zmers_all = pd.read_csv('./custom_db/ZDNA_noSTRs_chr'+str(chr_range)+'-22.csv.gz', compression = 'gzip')
Zmers_all['Type'] = 'ZDNA'

In [None]:
# Load G4 database
G4seq_filtered = pd.read_csv('./custom_db/G4seq_filtered_chr'+str(chr_range)+'-22.csv.gz', compression = 'gzip')
G4seq_filtered.columns = ['chrom', 'start', 'end', 'G4seq_score', 'status', 'Strand', 'length', 'Sequence', 'GGG_count', 'CCC_count']
G4seq_filtered['Type'] = 'G4'

### Combine into single database, calculate distance to motifs and transposons, then filter database <a name="DB_all_distance"></a>

#### Distance measurement explanation:
- Note: Measuring distance between coordinates using two different functions with slightly different behavior:

    - "distance_within_df()" function:
        - Compares each interval in dataframe A with every other interval in the same dataframe A.
        - 0 indicates overlap.
        - For nested overlaps (one completely inside the other), the smaller interval is labeled NaN and the larger interval ignores the smaller interval. This preserves the larger interval.

    - "measure_distance()" function:
        - Compares each interval in dataframe A with every other interval in different dataframe B.
        - Takes non-overlapping intervals from dataframe B, and finds the nearest two intervals in B for each interval in dataframe A, based on start coordinates.
        - Behavior is slightly counterintuitive for certain overlap configurations, but overlaps of any type (including both pairs of a nested overlap) will have at least left or right distance < 0, which is then replaced by 0 for both left and right.

#### To generate unique database:
- Combine all individual motif types into one database prior to distance filtering.
- Search each individual motif type for overlaps against the STR database, and against the full database (excluding STRs and the current motif type) using "measure_distance()".
- Search for overlaps within each individual database, preserving longest nested overlaps, using "distance_within_df()".
- Combine individual motif types again, this time post-search, into one database.
- Filter for distance_min > 0 for both distance categories.

[Return to Table of Contents](#TOC)

In [None]:
all_motifs = pd.concat([all_STRs, IR_expand_imperfect, MR_expand_imperfect, DR_expand_imperfect, Zmers_all, G4seq_filtered])
all_motifs['length'] = all_motifs['end'] - all_motifs['start']

In [None]:
IR_expand_imperfect = measure_distance(IR_expand_imperfect, all_motifs.loc[all_motifs['Type'] == 'STR'], 'STR')
MR_expand_imperfect = measure_distance(MR_expand_imperfect, all_motifs.loc[all_motifs['Type'] == 'STR'], 'STR')
DR_expand_imperfect = measure_distance(DR_expand_imperfect, all_motifs.loc[all_motifs['Type'] == 'STR'], 'STR')
Zmers_all = measure_distance(Zmers_all, all_motifs.loc[all_motifs['Type'] == 'STR'], 'STR')

# Exclude very short STRs from G4 distance filtering, because physical G4 detection should supercede questionably short STR motif detection
G4seq_filtered = measure_distance(G4seq_filtered, all_motifs.loc[(all_motifs['Type'] == 'STR') & (all_motifs['length'] > 7)], 'STR')

# Set these values to infinity, so that they don't interfere with filtering based on distance_within_df
all_STRs['STR_distance_left'] = np.inf
all_STRs['STR_distance_right'] = np.inf
all_STRs['STR_distance_min'] = np.inf

In [None]:
all_STRs = measure_distance(all_STRs, all_motifs.loc[(all_motifs['Type'] != 'STR')], 'nonSTR')
IR_expand_imperfect = measure_distance(IR_expand_imperfect, all_motifs.loc[(all_motifs['Type'] != 'STR') & (all_motifs['Type'] != 'IR')], 'nonSTR')
MR_expand_imperfect = measure_distance(MR_expand_imperfect, all_motifs.loc[(all_motifs['Type'] != 'STR') & (all_motifs['Type'] != 'MR')], 'nonSTR')
DR_expand_imperfect = measure_distance(DR_expand_imperfect, all_motifs.loc[(all_motifs['Type'] != 'STR') & (all_motifs['Type'] != 'DR')], 'nonSTR')
Zmers_all = measure_distance(Zmers_all, all_motifs.loc[(all_motifs['Type'] != 'STR') & (all_motifs['Type'] != 'ZDNA')], 'nonSTR')
G4seq_filtered = measure_distance(G4seq_filtered, all_motifs.loc[(all_motifs['Type'] != 'STR') & (all_motifs['Type'] != 'G4')], 'nonSTR')

In [None]:
all_STRs = distance_within_df(all_STRs, 'within_motif')
IR_expand_imperfect = distance_within_df(IR_expand_imperfect, 'within_motif')
MR_expand_imperfect = distance_within_df(MR_expand_imperfect, 'within_motif')
DR_expand_imperfect = distance_within_df(DR_expand_imperfect, 'within_motif')
Zmers_all = distance_within_df(Zmers_all, 'within_motif')
G4seq_filtered = distance_within_df(G4seq_filtered, 'within_motif')

In [None]:
# Join/sort by priority in order to use drop_duplicates(keep = 'first') below
all_STRs = pd.concat([all_STRs.loc[all_STRs['status'] == 'perfect'], all_STRs.loc[all_STRs['status'] == 'inframe'], all_STRs.loc[all_STRs['status'] == 'shortindel'], all_STRs.loc[all_STRs['status'] == 'complex']])
all_motifs = pd.concat([all_STRs.sort_values(by = ['chrom', 'start', 'end']), G4seq_filtered.sort_values(by = ['chrom', 'start', 'end']), Zmers_all.sort_values(by = ['chrom', 'start', 'end']), IR_expand_imperfect.sort_values(by = ['chrom', 'start', 'end']), MR_expand_imperfect.sort_values(by = ['chrom', 'start', 'end']), DR_expand_imperfect.sort_values(by = ['chrom', 'start', 'end'])])

In [None]:
all_motifs['overall_distance_min'] = [min(nonSTR_distance, STR_distance, within_distance) for nonSTR_distance, STR_distance, within_distance in zip(all_motifs['nonSTR_distance_min'], all_motifs['STR_distance_min'], all_motifs['within_motif_distance_min'])]
all_motifs['length'] = all_motifs['end'] - all_motifs['start']

In [None]:
# Calculate distance between motifs and transposable elements
all_motifs = measure_distance(all_motifs, repeatmasker.loc[~repeatmasker['repClass'].isin(['Simple_repeat', 'Low_complexity'])], 'RM')
# Remove motifs that overlap Repeatmasker non-simple regions
# Should only include G4 motifs at this point, since all others were found using masked reference genome
all_motifs = all_motifs.loc[all_motifs['RM_distance_min'] > 0]

In [None]:
# Save database (CSV format is human-readable)
all_motifs.to_csv('./custom_db/all_motifs_chr'+str(chr_range)+'-22.csv.gz', compression = 'gzip', index = False)
# Save database (Python-readable pickle file for easier loading)
all_motifs.to_pickle('./custom_db/all_motifs_chr'+str(chr_range)+'-22.pickle')

In [None]:
all_motifs = pd.read_csv('./custom_db/all_motifs_chr'+str(chr_range)+'-22.csv.gz', compression = 'gzip')

In [None]:
# Remove motifs that overlap with other motifs in the database
all_motifs_unique = all_motifs.loc[(all_motifs['STR_distance_min'] >0) & (all_motifs['nonSTR_distance_min'] >0) & (all_motifs['within_motif_distance_min'] >0)].copy()

# Remove any remaining fully-duplicated positions by priority (established above)
all_motifs_unique = all_motifs_unique.drop_duplicates(subset = ['chrom', 'start', 'end'], keep = 'first')

In [None]:
# Save database (CSV format is human-readable)
all_motifs_unique.to_csv('./custom_db/all_motifs_unique_chr'+str(chr_range)+'-22.csv.gz', compression = 'gzip', index = False)
# Save database (Python-readable pickle file for easier loading)
all_motifs_unique.to_pickle('./custom_db/all_motifs_unique_chr'+str(chr_range)+'-22.pickle')

In [None]:
# Load motif database
all_motifs_unique = pd.read_pickle('./custom_db/all_motifs_unique_chr'+str(chr_range)+'-22.pickle')

### Repeat database stats

#### Plots for Supplementary Figure S1A and B <a name="DB_nonbdb_plot_S1A"></a>

[Return to Table of Contents](#TOC)

In [None]:
stem_len_count = all_motifs.groupby(['Type', 'stem_len']).count()['chrom'].unstack()
len_count = all_motifs.groupby(['Type', 'length']).count()['chrom'].unstack()

stem_len_count_unique = all_motifs_unique.groupby(['Type', 'stem_len']).count()['chrom'].unstack()
len_count_unique = all_motifs_unique.groupby(['Type', 'length']).count()['chrom'].unstack()

In [None]:
motif_type_colors = make_default_colors(['STR', 'IR', 'DR', 'MR', 'ZDNA', 'G4'], opacity = 0.75, last_black=False)

In [None]:
len_fig_unique = go.Figure()
for motif in ['STR', 'ZDNA', 'G4']:
    len_fig_unique.add_trace(go.Scatter(x = len_count.transpose().index, y = len_count.transpose()[motif], legendgroup = motif, mode = 'lines', line = dict(color = motif_type_colors[1][motif]), name = motif))
    len_fig_unique.add_trace(go.Scatter(x = len_count_unique.transpose().index, y = len_count_unique.transpose()[motif], legendgroup = motif, showlegend = False, mode = 'lines', line = dict(color = motif_type_colors[1][motif], dash = 'dot'), name = motif))
for motif in ['IR', 'DR', 'MR']:
    len_fig_unique.add_trace(go.Scatter(x = stem_len_count.transpose().index, y = stem_len_count.transpose()[motif], legendgroup = motif, mode = 'lines', line = dict(color = motif_type_colors[1][motif]), name = motif))
    len_fig_unique.add_trace(go.Scatter(x = stem_len_count_unique.transpose().index, y = stem_len_count_unique.transpose()[motif], legendgroup = motif, showlegend = False, mode = 'lines', line = dict(color = motif_type_colors[1][motif], dash = 'dot'), name = motif))
len_fig_unique.update_yaxes(type = 'log', dtick = 1, title = dict(text = '# Motifs', font = dict(size = 18)))
len_fig_unique.update_xaxes(type = 'log', title = dict(text = 'Motif length', font = dict(size = 18)))
len_fig_unique.update_layout(height = 400, width = 800, margin = dict(l = 55, r = 5, b = 40, t = 30))

len_fig_unique.show()

In [None]:
# Fig. S1A
len_fig_unique.write_image('./plots/revision_repeatdb_length_count_fig_S1a.png', format='png', scale = 10, engine = 'orca')

In [None]:
repeats_highpower = ['T', 'G', 'TG', 'TC', 'TGG', 'ATT', 'TTG', 'TTC', 'TCC', 'ATG', 'TGC', 'ATTT', 'A', 'C',  'AC', 'AG', 'ACC', 'AAT', 'AAC', 'AAG', 'AGG', 'ATC', 'AGC', 'AAAT', 'AT', 'GC']
# Symmetric and asymmetric STR motifs
repeats_highpower_asym = pd.Series(['T', 'G', 'TG', 'TC', 'TGG', 'ATT', 'TTG', 'TTC', 'TCC', 'ATG', 'TGC', 'ATTT'], index = ['A', 'C',  'AC', 'AG', 'ACC', 'AAT', 'AAC', 'AAG', 'AGG', 'ATC', 'AGC', 'AAAT'])
repeats_highpower_sym = ['AT', 'GC']
repeats_highpower_all = ['A', 'C',  'AC', 'AG', 'AT', 'GC', 'ACC', 'AAT', 'AAC', 'AAG', 'AGG', 'ATC', 'AGC', 'AAAT']
STR_colors = make_default_colors(repeats_highpower_all, 0.75, last_black=False)

In [None]:
STR_motifs_forcount_unique = all_motifs_unique.loc[(all_motifs_unique['Type'] == 'STR') & (all_motifs_unique['repeat'].isin(repeats_highpower))][['length', 'repeat', 'status']].copy()
STR_motifs_forcount_unique['repeat_rc'] = [reverse_complement(repeat) if repeat in list(repeats_highpower_asym) else repeat for repeat in STR_motifs_forcount_unique['repeat']]
STR_len_count_unique = STR_motifs_forcount_unique.groupby(['repeat_rc', 'status', 'length']).count()['repeat']

In [None]:
str_len_fig_unique = go.Figure()
for motif in repeats_highpower_all:
    str_len_fig_unique.add_trace(go.Scatter(x = STR_len_count_unique.loc[motif].loc['perfect'].index, y = STR_len_count_unique.loc[motif].loc['perfect'], mode = 'lines', line = dict(color = STR_colors[1][motif]), name = motif))
    str_len_fig_unique.add_trace(go.Scatter(x = STR_len_count_unique.loc[motif].loc['inframe'].index, y = STR_len_count_unique.loc[motif].loc['inframe'], showlegend = False, mode = 'lines', line = dict(color = STR_colors[1][motif], dash = 'dot'), name = motif))
str_len_fig_unique.update_yaxes(type = 'log', dtick = 1, title = dict(text = '# Motifs', font = dict(size = 18)))
str_len_fig_unique.update_xaxes(type = 'log', range = [0.7,2.2], title = dict(text = 'Motif length', font = dict(size = 18)))
str_len_fig_unique.update_layout(height = 400, width = 800, margin = dict(l = 55, r = 5, b = 40, t = 30))
str_len_fig_unique.show()

In [None]:
# Fig. S1B
str_len_fig_unique.write_image('./plots/revision_repeatdb_STR_length_count_fig_S1b.png', format='png', scale = 10, engine = 'orca')

#### Motif type overlaps  <a name="DB_all_overlaps"></a>
- Check whether motifs consistently overlap with motifs in other categories
- Plots demonstrate the importance and effectiveness of filtering method described above
- Same approach can be used to assess overlaps in "Non-B DB" (https://nonb-abcc.ncifcrf.gov/apps/site/default)

[Return to Table of Contents](#TOC)

In [None]:
all_motif_types = dict()
motif_types = ['DR', 'G4', 'IR', 'MR', 'STR', 'ZDNA']
for motif in motif_types:
    current_motif = all_motifs.loc[all_motifs['Type'] == motif].copy()
    current_motif['motif_overlaps'] = ''
    for other in motif_types:
        current_motif = measure_distance(current_motif, all_motifs.loc[all_motifs['Type'] == other], other)
        current_motif['motif_overlaps'] = current_motif['motif_overlaps'] + [other + ' ' if pos <1 else '' for pos in current_motif[other + '_distance_min']]
        del current_motif[other+'_distance_left']; del current_motif[other+'_distance_right']; del current_motif[other+'_distance_min']
        print('\r' + motif + ' ' + other, end='         ', flush = True)
    all_motif_types[motif] = current_motif

all_motifs = pd.concat(all_motif_types)
all_motifs = all_motifs.sort_values(by = ['chrom', 'start', 'end']).reset_index(drop = True)

all_motifs['motif_overlaps'] = [motifs_list[:-1] for motifs_list in all_motifs['motif_overlaps']]

# Remove motifs that overlap with other motifs in the database
all_motifs_unique = all_motifs.loc[(all_motifs['overall_distance_min'] >0)].copy()

In [None]:
# Before filtering - Compare to Figure S1C
groups_df = pd.DataFrame(index = set(all_motifs['motif_overlaps']))
for group in all_motif_types:
    groups_df[group] = all_motifs.loc[all_motifs['Type'] == group]['motif_overlaps'].value_counts() / len(all_motifs.loc[all_motifs['Type'] == group])

# group together entries that comprise only a small percentage of the category
groups_df_slim = pd.DataFrame(index = set(all_motifs['motif_overlaps']))
for group in all_motif_types:
    groups_df_slim[group] = groups_df[group].loc[groups_df[group] > 0.1]
groups_df_slim = groups_df_slim.dropna(how = 'all').fillna(0)
relevant_groups = list(groups_df_slim.index)
groups_df_slim = pd.DataFrame(index = relevant_groups)
for group in all_motif_types:
    groups_df_slim[group] = groups_df[group].loc[relevant_groups]
groups_df_slim = groups_df_slim.dropna(how = 'all').fillna(0)
groups_df_slim.loc['Other multiple'] = [1- groups_df_slim[col].sum() for col in groups_df_slim.columns]
groups_df_slim = groups_df_slim.loc[[group for group in groups_df_slim.index if len(group.split(' ')) > 1]]
groups_df_slim.loc['Unique'] = [1- groups_df_slim[col].sum() for col in groups_df_slim.columns]

groups_df_slim = groups_df_slim.loc[list(groups_df_slim.index[-1:]) + list(groups_df_slim.index[:-1])]
groups_df_slim = groups_df_slim[['IR', 'G4', 'STR', 'DR', 'MR', 'ZDNA']]
groups_df_slim_color = pd.Series(plotly.colors.DEFAULT_PLOTLY_COLORS[:len(groups_df_slim.index)-1] + ['rgb(220,220,220)'], index = groups_df_slim.index)

group_overlap_fig = go.Figure()
for types in groups_df_slim.index:
    group_overlap_fig.add_trace(go.Bar(name=types, x = groups_df_slim.columns, y = groups_df_slim.loc[types], text = types, marker = dict(color = groups_df_slim_color[types])))
group_overlap_fig.update_layout(barmode='stack')
group_overlap_fig.show()

In [None]:
# After filtering  - Compare to Figure S1C
groups_df = pd.DataFrame(index = set(all_motifs_unique['motif_overlaps']))
for group in all_motif_types:
    groups_df[group] = all_motifs_unique.loc[all_motifs_unique['Type'] == group]['motif_overlaps'].value_counts() / len(all_motifs_unique.loc[all_motifs_unique['Type'] == group])

# group together entries that comprise only a small percentage of the category
groups_df_slim = pd.DataFrame(index = set(all_motifs_unique['motif_overlaps']))
for group in all_motif_types:
    groups_df_slim[group] = groups_df[group].loc[groups_df[group] > 0.15]
groups_df_slim = groups_df_slim.dropna(how = 'all').fillna(0)
relevant_groups = list(groups_df_slim.index)
groups_df_slim = pd.DataFrame(index = relevant_groups)
for group in all_motif_types:
    groups_df_slim[group] = groups_df[group].loc[relevant_groups]
groups_df_slim = groups_df_slim.dropna(how = 'all').fillna(0)
groups_df_slim.loc['Other multiple'] = [1- groups_df_slim[col].sum() for col in groups_df_slim.columns]
groups_df_slim = groups_df_slim.loc[[group for group in groups_df_slim.index if len(group.split(' ')) > 1]]
groups_df_slim.loc['Unique'] = [1- groups_df_slim[col].sum() for col in groups_df_slim.columns]

groups_df_slim = groups_df_slim.loc[list(groups_df_slim.index[-1:]) + list(groups_df_slim.index[:-1])]
groups_df_slim = groups_df_slim[['IR', 'G4', 'STR', 'DR', 'MR', 'ZDNA']]
groups_df_slim_color = pd.Series(plotly.colors.DEFAULT_PLOTLY_COLORS[:len(groups_df_slim.index)-1] + ['rgb(220,220,220)'], index = groups_df_slim.index)

group_overlap_fig = go.Figure()
for types in groups_df_slim.index:
    group_overlap_fig.add_trace(go.Bar(name=types, x = groups_df_slim.columns, y = groups_df_slim.loc[types], text = types, marker = dict(color = groups_df_slim_color[types])))
group_overlap_fig.update_layout(barmode='stack')
group_overlap_fig.show()

In [None]:
# Overall minimum distance to next motif, represented by percentile  - Compare to Figure S1D
plot_values = all_motifs.drop_duplicates(subset = ['chrom', 'start', 'end'])['overall_distance_min'].fillna(0).quantile([i/100 for i in range(1,99)])
# Count all overlaps as 0
plot_values[plot_values < 0] = 0

distance_all_motifs_fig = go.Figure()
distance_all_motifs_fig.add_trace(go.Bar(x = list(range(1,99)), y = plot_values))
distance_all_motifs_fig.update_yaxes(title = 'distance to nearest motif (nt =0 indicates overlap)')
distance_all_motifs_fig.update_xaxes(title = 'quantile')
distance_all_motifs_fig.show()

## Random sequences <a name="DB_random"></a>

[Return to Table of Contents](#TOC)

In [None]:
def generate_random_coordinates(prefilter_number):
    # Select random regions proportionally across all chromosomes
    # Use reference_genome_masked_STR to find random sequences that don't overlap with STRs or transposons
    # Pre-filter number is the initial number of random sequences across chromsomes 1-22, though many will be filtered out.
    count_prefilter = prefilter_number/chr_range
    reference_portion = pd.Series([len(reference_genome_masked_STR[chrom]) for chrom in range(chr_range,23)], index = range(chr_range,23)) / sum([len(reference_genome_masked_STR[chrom]) for chrom in range(chr_range,23)])
    num_to_sample = (reference_portion * count_prefilter).astype(int)

    import random
    random_coordinates = dict()
    for chrom in range(chr_range,23):
        random_coordinates[chrom] = pd.DataFrame()
        random_coordinates[chrom]['start'] = pd.Series(random.sample(range(chr_range,len(reference_genome_masked_STR[chrom])), num_to_sample[chrom]))
        random_coordinates[chrom]['end'] = random_coordinates[chrom]['start'] + 30
        random_coordinates[chrom]['chrom'] = chrom
    random_coordinates = pd.concat([random_coordinates[chrom] for chrom in random_coordinates])

    # Remove any sequences with Ns
    random_coordinates['Sequence'] = [reference_genome_masked_STR[chrom][start:end].upper() for chrom, start, end in zip(random_coordinates['chrom'], random_coordinates['start'], random_coordinates['end'])]
    random_coordinates['count_N'] = [seq.count('N') for seq in random_coordinates['Sequence']]
    random_coordinates = random_coordinates.loc[random_coordinates['count_N'] == 0]
    del random_coordinates['count_N']

    # Distance to nearest motif
    random_coordinates = measure_distance(random_coordinates, all_motifs.loc[all_motifs['Type'] == 'STR'], 'STR')
    random_coordinates = measure_distance(random_coordinates, all_motifs.loc[(all_motifs['Type'] != 'STR') & (all_motifs['Type'] != 'IR')], 'nonSTR')
    random_coordinates = distance_within_df(random_coordinates, 'within_motif')

    # Distance to nearest Repeatmasker entry (excluding simple and low complexity)
    random_coordinates = measure_distance(random_coordinates, repeatmasker.loc[~repeatmasker['repClass'].isin(['Simple_repeat', 'Low_complexity'])], 'RM')

    # Filter out sequences with nearby motifs/transposons
    RM_filter_distance = 5; nonSTR_filter_distance = 5; STR_filter_distance = 20; within_motif_filter_distance = 20

    random_filtered = random_coordinates.loc[(random_coordinates['RM_distance_min'] >= RM_filter_distance) & (random_coordinates['STR_distance_min'] >= STR_filter_distance) & (random_coordinates['nonSTR_distance_min'] >= nonSTR_filter_distance)  & (random_coordinates['within_motif_distance_min'] >= within_motif_filter_distance)]
    
    return random_filtered

In [None]:
# Generate two independent sets of random coordinates
random_set1 = generate_random_coordinates(500000)
random_set2 = generate_random_coordinates(500000)

# Save database (for repeatability)
random_set1.to_csv('./custom_db/random_sequences_set1_chr'+str(chr_range)+'-22.csv.gz', compression = 'gzip', index = False)
random_set2.to_csv('./custom_db/random_sequences_set2_chr'+str(chr_range)+'-22.csv.gz', compression = 'gzip', index = False)

In [None]:
# Generate two (larger) independent sets of random coordinates
random_set3 = generate_random_coordinates(5000000)
random_set4 = generate_random_coordinates(5000000)

# Save database (for repeatability)
random_set3.to_csv('./custom_db/random_sequences_set3_chr'+str(chr_range)+'-22.csv.gz', compression = 'gzip', index = False)
random_set4.to_csv('./custom_db/random_sequences_set4_chr'+str(chr_range)+'-22.csv.gz', compression = 'gzip', index = False)

## Non-B DB <a name="DB_nonbdb"></a>
- For comparison purposes
- Download at https://nonb-abcc.ncifcrf.gov/apps/nBMST/default/
- Choose hg38 version: "human_hg38.tsv.tar.gz"
- Save to directory "./nonbdb/"

### Modify and annotate Non-B DB <a name="DB_nonbdb_modify"></a>

#### Load and format Non-B Database <a name="DB_nonbdb_load"></a>

[Return to Table of Contents](#TOC)

In [None]:
# Load Non-B DB database file
nonbdb = pd.read_csv('./nonbdb/human_hg38.tsv.tar.gz', sep = '\t', low_memory = False)

# Format database
nonbdb = nonbdb.dropna(how = 'all', axis=0)
del nonbdb['Source']; del nonbdb['Strand']
nonbdb.columns = ['chrom', 'Type', 'start', 'end', 'length', 'Score', 'repeat', 'Spacer', 'Tracts', 'Subset', 'Composition', 'Sequence']
# Filter to desired motifs
nonbdb = nonbdb.loc[nonbdb['Type'].isin(['Direct_Repeat', 'Inverted_Repeat', 'Mirror_Repeat', 'Short_Tandem_Repeat', 'Z_DNA_Motif'])].copy()
# Filter to chromosomes 1-22
nonbdb['chrom'] = [chrom[3:] for chrom in nonbdb['chrom']]
nonbdb = nonbdb.loc[nonbdb['chrom'].str.isnumeric()].copy()
nonbdb['chrom'] = nonbdb['chrom'].astype(int)
# Filter to chrom range for testing purposes
nonbdb = nonbdb.loc[nonbdb['chrom'].isin(list(range(chr_range, 23)))].copy()
# Change start coordinates to base-0
nonbdb['start'] = nonbdb['start'].astype(int) -1

#### Add in G4 motifs from G4-seq data <a name="DB_nonbdb_G4"></a>
- Replaces G4 motifs in Non-B DB with G4 motifs from G4seq, as annotated by this study

[Return to Table of Contents](#TOC)

In [None]:
# Load file from above G4 section
G4seq_filtered = pd.read_csv('./custom_db/G4seq_filtered_chr'+str(chr_range)+'-22.csv.gz', compression = 'gzip', usecols = ['chrom', 'start', 'end', 'score', 'status', 'Strand', 'length', 'Sequence'])
# Format
G4seq_filtered.columns = ['chrom', 'start', 'end', 'Score', 'status', 'Strand', 'length', 'Sequence']
G4seq_filtered['Type'] = 'G4'
# Filter to K+ condition
G4seq_filtered = G4seq_filtered.loc[G4seq_filtered['status'] != 'PDS'].copy()
del G4seq_filtered['status']

In [None]:
# Combine
nonbdb = pd.concat([nonbdb, G4seq_filtered])
# Format
for col in ['start', 'end', 'length']:
    nonbdb[col] = nonbdb[col].astype(int)
for col in ['Score', 'repeat', 'Spacer']:
    nonbdb[col] = nonbdb[col].astype(float)

#### Calculate proximity to other motifs

In [None]:
# Calculate distance to nearest nonBdb motif
nonbdb = distance_within_df(nonbdb, 'nonbdb')

#### Add in random sequences <a name="DB_nonbdb_random"></a>

[Return to Table of Contents](#TOC)

In [None]:
# Generate random sequences without filtering for proximity to motifs or transposons
def generate_random_coordinates_nofilter():
    # Select random regions proportionally across all chromosomes
    # Use reference_genome_masked_STR to find random sequences that don't overlap with STRs or transposons
    # Target number is 500,000 across chromsomes 1-22, though many will be filtered out.
    count_prefilter = 500000/chr_range
    reference_portion = pd.Series([len(reference_genome[chrom]) for chrom in range(chr_range,23)], index = range(chr_range,23)) / sum([len(reference_genome[chrom]) for chrom in range(chr_range,23)])
    num_to_sample = (reference_portion * count_prefilter).astype(int)

    import random
    random_coordinates = dict()
    for chrom in range(chr_range,23):
        random_coordinates[chrom] = pd.DataFrame()
        random_coordinates[chrom]['start'] = pd.Series(random.sample(range(chr_range,len(reference_genome[chrom])), num_to_sample[chrom]))
        random_coordinates[chrom]['end'] = random_coordinates[chrom]['start'] + 30
        random_coordinates[chrom]['chrom'] = chrom
    random_coordinates = pd.concat([random_coordinates[chrom] for chrom in random_coordinates])

    # Remove any sequences with Ns
    random_coordinates['Sequence'] = [reference_genome[chrom][start:end].upper() for chrom, start, end in zip(random_coordinates['chrom'], random_coordinates['start'], random_coordinates['end'])]
    random_coordinates['count_N'] = [seq.count('N') for seq in random_coordinates['Sequence']]
    random_coordinates = random_coordinates.loc[random_coordinates['count_N'] == 0]
    del random_coordinates['count_N']
    
    random_coordinates['length'] = random_coordinates['end'] - random_coordinates['start']

    return random_coordinates

In [None]:
# Generate sequences
random_nofilter = generate_random_coordinates_nofilter()
# Measure distance to NonB-DB motifs
random_nofilter = measure_distance(random_nofilter, nonbdb, 'nonbdb')
random_nofilter['Type'] = 'random'

In [None]:
# Add random sequences to NonB-DB
nonbdb = pd.concat([nonbdb, random_nofilter])

#### Filter database by proximity to motifs and transposons (or don't) <a name="DB_nonbdb_filter"></a>

[Return to Table of Contents](#TOC)

In [None]:
# Calculate distance between motifs and transposable elements
nonbdb = measure_distance(nonbdb, repeatmasker.loc[~repeatmasker['repClass'].isin(['Simple_repeat', 'Low_complexity'])], 'RM')

In [None]:
# Save modified NonB-DB
nonbdb.to_csv('./nonbdb/nonbdb_modified_chr'+str(chr_range)+'-22.csv.gz', compression = 'gzip', index = False)

In [None]:
# Load
nonbdb = pd.read_csv('./nonbdb/nonbdb_modified_chr'+str(chr_range)+'-22.csv.gz', compression = 'gzip', low_memory = False)

In [None]:
# Filter for transposon distance
nonbdb_filtered = nonbdb.loc[(nonbdb['RM_distance_min'] > 500)]
# Filter for overlaps with Non-B motifs
nonbdb_unique = nonbdb_filtered.loc[(nonbdb_filtered['nonbdb_distance_min'] > 0)]
# Basic filter for removing motifs with neighboring motifs
nonbdb_distant = nonbdb_filtered.loc[(nonbdb_filtered['nonbdb_distance_min'] > 250)]

In [None]:
repeat_counts = pd.DataFrame(); repeat_counts['all'] = nonbdb['Type'].value_counts(); repeat_counts['filtered'] = nonbdb_filtered['Type'].value_counts(); repeat_counts['unique'] = nonbdb_unique['Type'].value_counts(); repeat_counts['distant'] = nonbdb_distant['Type'].value_counts()
repeat_counts

### Plot overlaps between Non-B DB categories <a name="DB_nonbdb_plot"></a>

[Return to Table of Contents](#TOC)

In [None]:
all_motif_types = dict()
motif_types = ['Direct_Repeat', 'G4', 'Inverted_Repeat', 'Mirror_Repeat', 'Short_Tandem_Repeat', 'Z_DNA_Motif']
for motif in motif_types:
    current_motif = nonbdb.loc[nonbdb['Type'] == motif].copy()
    current_motif['motif_overlaps'] = ''
    for other in motif_types:
        current_motif = measure_distance(current_motif, nonbdb.loc[nonbdb['Type'] == other], other)
        current_motif['motif_overlaps'] = current_motif['motif_overlaps'] + [other + ' ' if pos <1 else '' for pos in current_motif[other + '_distance_min']]
        del current_motif[other+'_distance_left']; del current_motif[other+'_distance_right']; del current_motif[other+'_distance_min']
        print('\r' + motif + ' ' + other, end='         ', flush = True)
    all_motif_types[motif] = current_motif

nonbdb = pd.concat(all_motif_types)
nonbdb = nonbdb.sort_values(by = ['chrom', 'start', 'end']).reset_index(drop = True)

nonbdb['motif_overlaps'] = [motifs_list[:-1] for motifs_list in nonbdb['motif_overlaps']]

In [None]:
# Before filtering
groups_df = pd.DataFrame(index = set(nonbdb['motif_overlaps']))
for group in all_motif_types:
    groups_df[group] = nonbdb.loc[nonbdb['Type'] == group]['motif_overlaps'].value_counts() / len(nonbdb.loc[nonbdb['Type'] == group])

# group together entries that comprise only a small percentage of the category
groups_df_slim = pd.DataFrame(index = set(nonbdb['motif_overlaps']))
for group in all_motif_types:
    groups_df_slim[group] = groups_df[group].loc[groups_df[group] > 0.075]
groups_df_slim = groups_df_slim.dropna(how = 'all').fillna(0)
relevant_groups = list(groups_df_slim.index)
groups_df_slim = pd.DataFrame(index = relevant_groups)
for group in all_motif_types:
    groups_df_slim[group] = groups_df[group].loc[relevant_groups]
groups_df_slim = groups_df_slim.dropna(how = 'all').fillna(0)
groups_df_slim.loc['Other multiple'] = [1- groups_df_slim[col].sum() for col in groups_df_slim.columns]
groups_df_slim = groups_df_slim.loc[[group for group in groups_df_slim.index if len(group.split(' ')) > 1]]
groups_df_slim.loc['Unique'] = [1- groups_df_slim[col].sum() for col in groups_df_slim.columns]

groups_df_slim_names = [names.replace('Inverted_Repeat', 'IR').replace('Direct_Repeat', 'DR').replace('Mirror_Repeat', 'MR').replace('Short_Tandem_Repeat', 'STR').replace('Z_DNA_Motif', 'ZDNA') for names in groups_df_slim.index]
groups_df_slim.index = groups_df_slim_names
groups_df_slim = groups_df_slim.loc[list(groups_df_slim.index[-1:]) + list(groups_df_slim.index[:-1])]

groups_df_slim_colnames = [names.replace('Inverted_Repeat', 'IR').replace('Direct_Repeat', 'DR').replace('Mirror_Repeat', 'MR').replace('Short_Tandem_Repeat', 'STR').replace('Z_DNA_Motif', 'ZDNA') for names in groups_df_slim.columns]
groups_df_slim.columns = groups_df_slim_colnames
groups_df_slim = groups_df_slim[['IR', 'G4', 'STR', 'DR', 'MR', 'ZDNA']]

groups_df_slim_color = pd.Series(plotly.colors.DEFAULT_PLOTLY_COLORS[:len(groups_df_slim.index)-1] + ['rgb(220,220,220)'], index = groups_df_slim.index)

#### Plot for Supplementary Figure S1c <a name="DB_nonbdb_plot_S1C"></a>

[Return to Table of Contents](#TOC)

In [None]:
group_overlap_fig_S1c = go.Figure()
for types in groups_df_slim.index:
    group_overlap_fig_S1c.add_trace(go.Bar(name=types, x = groups_df_slim.columns, y = groups_df_slim.loc[types], marker = dict(color = groups_df_slim_color[types])))
group_overlap_fig_S1c.update_layout(barmode='stack')

group_overlap_fig_S1c.update_xaxes(tickfont  = dict(size = 14))
group_overlap_fig_S1c.update_yaxes(tickfont  = dict(size = 14))
group_overlap_fig_S1c.update_yaxes(title = dict(text = 'Portion of category', font = dict(size = 18), standoff = 0))
group_overlap_fig_S1c.update_layout(height = 300, width = 600, margin = dict(l = 55, r = 5, b = 20, t = 20))

group_overlap_fig_S1c.show()

In [None]:
group_overlap_fig_S1c.write_image('./plots/revision_nonbdb_category_overlap_fig_S1c.png', format='png', scale = 10, engine = 'orca')

#### Plot for Supplementary Figure S1d <a name="DB_nonbdb_plot_S1D"></a>

[Return to Table of Contents](#TOC)

In [None]:
# Minimum distance to next Non-B DB motif, represented by percentile
plot_values = nonbdb.drop_duplicates(subset = ['chrom', 'start', 'end'])['nonbdb_distance_min'].fillna(0).quantile([i/100 for i in range(1,99)])
# Count all overlaps as 0
plot_values[plot_values < 0] = 0
# Minimum distance to next Non-B DB motif or Repeatmasker transposon, represented by percentile
plot_values_rm = nonbdb.drop_duplicates(subset = ['chrom', 'start', 'end'])[['nonbdb_distance_min', 'RM_distance_min']].min(axis=1).fillna(0).quantile([i/100 for i in range(1,99)])
# Count all overlaps as 0
plot_values_rm[plot_values_rm < 0] = 0

In [None]:
distance_nonbdb_fig_S1d = go.Figure()
distance_nonbdb_fig_S1d.add_trace(go.Scatter(x = list(range(1,99)), y = plot_values, name = 'NonB', opacity = 0.75))
distance_nonbdb_fig_S1d.add_trace(go.Scatter(x = list(range(1,99)), y = plot_values_rm, name = 'NonB and Repeatmasker', opacity = 0.75))
distance_nonbdb_fig_S1d.update_yaxes(range = [-20, 575], tickfont = dict(size = 14), title = dict(text = 'Distance to nearest motif (nt)', font = dict(size = 16), standoff = 0))
distance_nonbdb_fig_S1d.update_xaxes(tickfont = dict(size = 14), title = dict(text = 'Percentile', font = dict(size = 18), standoff = 0))
distance_nonbdb_fig_S1d.update_layout(height = 300, width = 400, margin = dict(l = 55, r = 5, b = 35, t = 30), legend = dict(x = 0.05, y = 1.1, orientation = 'h', font = dict(size = 14)))

distance_nonbdb_fig_S1d.show()

In [None]:
distance_nonbdb_fig_S1d.write_image('./plots/revision_nonbdb_overlap_fig_S1d.png', format='png', scale = 10, engine = 'orca')

# Prepare gnomAD database  <a name="GNOMAD"></a>
Download from https://gnomad.broadinstitute.org/downloads
- WARNING: files are extremely large
- Select gnomAD V3
- Download individual chromosome files
- Chr22 example: https://storage.googleapis.com/gcp-public-data--gnomad/release/3.1.1/vcf/genomes/gnomad.genomes.v3.1.1.sites.chr22.vcf.bgz
- Place in directory './gnomad/temp/'

[Return to Table of Contents](#TOC)

## Shrink gnomAD files by removing unneeded information <a name="GNOMAD_shrink"></a>

#### Read gnomAD file in chunks, and output each chunk to a temporary file <a name="GNOMAD_shrink_read"></a>
- This is a very slow and inefficient step, but necessary in order to use the freely available gnomAD files.
    - ~8 hours for Chr22
- Different chromosomes can be run in parallel, depending on CPU and RAM limitations:
    - Copy the following notebook cell into another notebook.
    - Replace the line "for chrom in range(chr_range,23):" with "chrom = 22" and change the chromosome number each time.
    
[Return to Table of Contents](#TOC)

In [None]:
for chrom in range(chr_range,23):

    # Can lower chunk size to reduce memory usage.
    chunk_size = 50000

    download_path = ('./gnomad/temp/')
    file_list = [file for file in os.listdir(download_path) if file.endswith('.vcf.bgz') == True]
    pickle_list = [file for file in os.listdir(download_path) if file.endswith('chr'+str(chrom)+'.pickle') == True]
    skip = 943
    n = 0
    time_start = time()
    file = 'gnomad.genomes.v3.1.1.sites.chr'+str(chrom)+'.vcf.bgz'

    reader = pd.read_csv(download_path+file, sep = '\t', skiprows = skip, header = None, usecols = [1,3,4,6,7], compression = 'gzip', chunksize = chunk_size)
    for chunk in reader:
        if n >= len(pickle_list):
            snps = pd.DataFrame(chunk)
            snps.columns = ['POS', 'REF', 'ALT', 'FILTER', 'INFO']
            for infotype in ['AC', 'AN', 'n_alt_alleles', 'nhomalt', 'variant_type', 'DP', 'VarDP', 'InbreedingCoeff', 'MQ', 'MQRankSum', 'ReadPosRankSum', 'QD', 'AS_VQSLOD', 'AS_FS', 'AS_MQ', 'AS_MQRankSum', 'AS_pab_max', 'AS_QD', 'AS_ReadPosRankSum', 'AS_SOR']:
                snps[infotype] = snps['INFO'].str.split(infotype+'=', expand = True)[1].str.split(';', expand = True)[0]
            for infotype in ['AC', 'AN', 'n_alt_alleles', 'nhomalt', 'DP', 'VarDP']:
                snps[infotype] = snps[infotype].astype(int)
            for infotype in ['MQ', 'QD', 'AS_VQSLOD', 'InbreedingCoeff', 'MQRankSum', 'ReadPosRankSum', 'AS_FS', 'AS_MQ', 'AS_MQRankSum', 'AS_pab_max', 'AS_QD', 'AS_ReadPosRankSum', 'AS_SOR']:
                snps[infotype] = snps[infotype].replace('.', np.nan).astype(float)
            snps = snps.drop(['INFO'], axis=1)
            snps = snps.loc[(snps['AC'] > 0)]
            n+=1
            print('chr' + str(chrom) + '_' + str(n), end="\r", flush=True)
            snps.to_pickle(download_path+'chunk_'+str(n)+'_chr'+str(chrom)+'.pickle')
        else:
            print('chr' + str(chrom) + '_skip_' + str(n), end="\r", flush=True)
            n+=1

    time_end = time()
    print(str(chrom) + ' ' + str((time_end - time_start) / 3600) + ' hours')

#### Combine chunks into one .csv.gz file per chromosome <a name="GNOMAD_shrink_combine"></a>
[Return to Table of Contents](#TOC)

In [None]:
for chrom in range(chr_range,23):
    pickle_list = [file for file in os.listdir(download_path) if file.endswith('chr'+str(chrom)+'.pickle') == True]

    all_variants = dict()
    n=0
    for pick in pickle_list:
        n+=1
        all_variants[n] = pd.read_pickle(download_path+pick)
        print('chr' + str(chrom) + ' ' + str(n) + ' / '+str(len(pickle_list)) + '          ', end="\r", flush=True)

    all_variants_df = pd.concat(all_variants).reset_index(drop = True)
    all_variants_df = all_variants_df.drop_duplicates(keep = 'first')

    all_variants_df.to_csv('./gnomad/all_variants_chr'+str(chrom)+'.csv.gz', index = False, compression = 'gzip')

#### Filter to rare SNVs and format files for mutation counting <a name="GNOMAD_rare"></a>
- also calculate genome GC correction factors at the same time (for efficiency)

[Return to Table of Contents](#TOC)

In [None]:
def slim_gnomad_files_gc_content(chrom):
    gnomad_current = pd.read_csv('./gnomad/all_variants_chr'+str(chrom)+'.csv.gz', compression = 'gzip', usecols = ['POS', 'REF', 'ALT', 'AC', 'AN', 'AS_VQSLOD', 'InbreedingCoeff'])
    print('processing chr' + str(chrom) + '      ', end="\r", flush=True)

    # Select only rare variants
    gnomad_current = gnomad_current.loc[gnomad_current['AC'] / gnomad_current['AN'] <  10**-4].copy()

    # SNVs only
    gnomad_current['ref_len'] = [len(ref) for ref in gnomad_current['REF']]
    gnomad_current['alt_len'] = [len(alt) for alt in gnomad_current['ALT']]
    gnomad_current = gnomad_current.loc[(gnomad_current['ref_len'] == 1) & (gnomad_current['alt_len'] == 1)].copy()
    del gnomad_current['ref_len']; del gnomad_current['alt_len']

    # One line per position, keeping AC and AS_VQSLOD quality scores per mutation type
    gnomad_byalt = dict()
    for alt in ['A', 'T', 'G', 'C']:
        gnomad_byalt[alt] = gnomad_current.loc[gnomad_current['ALT'] == alt].copy()
    gnomad_slim = pd.concat([gnomad_byalt[alt][['POS', 'AC']].set_index('POS') for alt in ['A', 'T', 'G', 'C']], axis=1).fillna(0).astype(int)
    gnomad_slim.columns = ['A', 'T', 'G', 'C']
    gnomad_slim = gnomad_slim.sort_index()
    gnomad_slim['Tri'] = [tri_function(chrom,pos) for pos in gnomad_slim.index]
    for alt in ['A', 'T', 'G', 'C']:
        gnomad_slim['qual_'+alt] = gnomad_byalt[alt][['POS', 'AS_VQSLOD']].set_index('POS')
        gnomad_slim['inbr_'+alt] = gnomad_byalt[alt][['POS', 'InbreedingCoeff']].set_index('POS')
        
    # Calculate GC content +/- 51nt for each SNV position
    nmer = 51
    current_gc = pd.DataFrame(list(reference_genome[chrom]), columns = ['seq'])
    current_gc['GC'] = current_gc['seq'].isin(['G', 'C'])
    current_gc['N'] = current_gc['seq'] == 'N'
    current_gc['GC_'+str(nmer)] = current_gc['GC'].rolling(nmer, center = True).mean()
    current_gc['N_'+str(nmer)] = current_gc['N'].rolling(nmer, center = True).mean()
    chrom_nucleotide_content = pd.DataFrame(current_gc.loc[current_gc['N_'+str(nmer)] == False]['GC_'+str(nmer)]).round(3)
    chrom_nucleotide_content.index = chrom_nucleotide_content.index +1
    gnomad_slim['GC_'+str(nmer)] = chrom_nucleotide_content['GC_'+str(nmer)].reindex(gnomad_slim.index)
    chrom_nucleotide_content['Tri'] = [tri_function(chrom, pos) for pos in chrom_nucleotide_content.index]
    chrom_nucleotide_content['count'] = 1
    
    print('saving chr' + str(chrom) + '        ', end="\r", flush=True)
    gnomad_slim.to_csv('./gnomad/gnomad_chr'+str(chrom)+'_lowN_AC-tri-qual.csv.gz', compression = 'gzip')
    print('finished chr' + str(chrom) + '      ', end="\r", flush=True)

    return triplet_combine_RC(chrom_nucleotide_content.groupby(['GC_'+str(nmer), 'Tri']).count()['count'].unstack().transpose(), mut_output=True)

In [None]:
genome_nucleotide_count = dict()
for chrom in range(chr_range, 23):
    genome_nucleotide_count[chrom] = slim_gnomad_files_gc_content(chrom)
    print('finished chr' + str(chrom) + '      ', end="\r", flush=True)

In [None]:
genome_nucleotide_count_all = pd.concat(genome_nucleotide_count).groupby(pd.concat(genome_nucleotide_count).index.get_level_values(1)).sum().astype(int)
combined_nucleotide_count = triplet_combine_RC(genome_nucleotide_count_all, mut_input=True).astype(int)
combined_nucleotide_count.to_csv('./hg38/hg38_triplet_counts_by_GCwindow_nmer51_chr'+str(chr_range)+'-22.csv')

### Load gnomAD SNV database and calculate trinucleotide mutation frequency <a name="GNOMAD_freq"></a>
[Return to Table of Contents](#TOC)

In [None]:
# Load SNV database
gnomad_slim_all = dict()
for chrom in range(chr_range, 23):
    gnomad_slim_all[chrom] = pd.read_csv('./gnomad/gnomad_chr'+str(chrom)+'_lowN_AC-tri-qual.csv.gz', compression = 'gzip')
    gnomad_slim_all[chrom].set_index('POS', inplace = True)
    print('loaded chr' + str(chrom) + '      ', end="\r", flush=True)

combined_nucleotide_count = pd.read_csv('./hg38/hg38_triplet_counts_by_GCwindow_nmer51_chr'+str(chr_range)+'-22.csv', index_col = 0)

In [None]:
# Number of genomes in the the gnomAD 3.1 database
gnomad_n_genomes = 76156

# List of VQSLOD and InbreedingCoefficient values to use as QC filters
#(the first set of values is for no filtration, and the second set corresponds to gnomAD's recommended passing score)
qc_cutoff_list = [(-np.inf, -np.inf), (-2.774, -0.3), (0, -0.3), (4, -0.3)]

#### Generate downsampled gnomAD files based on AC  <a name="GNOMAD_downsample_generate"></a>
[Return to Table of Contents](#TOC)

In [None]:
def downsample_gnomad_files(chrom):

    gnomad_current = pd.read_csv('./gnomad/all_variants_chr'+str(chrom)+'.csv.gz', compression = 'gzip', usecols = ['POS', 'REF', 'ALT', 'AC', 'AN', 'AS_VQSLOD', 'InbreedingCoeff'])
    print('loaded chr' + str(chrom) + '    ', end="\r", flush=True)

    # Select only rare variants
    gnomad_current = gnomad_current.loc[gnomad_current['AC'] / gnomad_current['AN'] <  10**-4].copy()

    # SNVs only
    gnomad_current['ref_len'] = [len(ref) for ref in gnomad_current['REF']]
    gnomad_current['alt_len'] = [len(alt) for alt in gnomad_current['ALT']]
    gnomad_current = gnomad_current.loc[(gnomad_current['ref_len'] == 1) & (gnomad_current['alt_len'] == 1)].copy()
    del gnomad_current['ref_len']; del gnomad_current['alt_len']

    # How much to downsample:
    for sample_frac in [0.1, 0.01, 0.004]:

        # Random downsampling
        current_sample = dict()
        for AC in gnomad_current['AC'].value_counts().index:
            current_sample[AC] = gnomad_current.loc[gnomad_current['AC'] == AC].sample(frac = min(sample_frac * AC, 1), random_state = 0)
        current_sample = pd.concat(current_sample).reset_index(drop = True)

        # One line per position, keeping AC and AS_VQSLOD quality scores per mutation type
        gnomad_byalt = dict()
        for alt in ['A', 'T', 'G', 'C']:
            gnomad_byalt[alt] = current_sample.loc[current_sample['ALT'] == alt].copy()
        gnomad_slim = pd.concat([gnomad_byalt[alt][['POS', 'AC']].set_index('POS') for alt in ['A', 'T', 'G', 'C']], axis=1).fillna(0).astype(int)
        gnomad_slim.columns = ['A', 'T', 'G', 'C']
        gnomad_slim = gnomad_slim.sort_index()
        gnomad_slim['Tri'] = [tri_function(chrom,pos) for pos in gnomad_slim.index]
        for alt in ['A', 'T', 'G', 'C']:
            gnomad_slim['qual_'+alt] = gnomad_byalt[alt][['POS', 'AS_VQSLOD']].set_index('POS')
            gnomad_slim['inbr_'+alt] = gnomad_byalt[alt][['POS', 'InbreedingCoeff']].set_index('POS')

        nmer = 51
        current_gc = pd.DataFrame(list(reference_genome[chrom]), columns = ['seq'])
        current_gc['GC'] = current_gc['seq'].isin(['G', 'C'])
        current_gc['N'] = current_gc['seq'] == 'N'
        current_gc['GC_'+str(nmer)] = current_gc['GC'].rolling(nmer, center = True).mean()
        current_gc['N_'+str(nmer)] = current_gc['N'].rolling(nmer, center = True).mean()
        chrom_nucleotide_content = pd.DataFrame(current_gc.loc[current_gc['N_'+str(nmer)] == False]['GC_'+str(nmer)]).round(3)
        chrom_nucleotide_content.index = chrom_nucleotide_content.index +1
        gnomad_slim['GC_'+str(nmer)] = chrom_nucleotide_content['GC_'+str(nmer)].reindex(gnomad_slim.index)
        chrom_nucleotide_content['Tri'] = [tri_function(chrom, pos) for pos in chrom_nucleotide_content.index]
        chrom_nucleotide_content['count'] = 1
            
            
        print('saving chr' + str(chrom) + 'sample_frac ' + str(sample_frac) + '        ', end="\r", flush=True)
        gnomad_slim.to_csv('./gnomad/gnomad31_slim_downsample_'+str(sample_frac)+'_chr'+str(chrom)+'_AC-tri-qual.csv.gz', compression = 'gzip')
          
    print('finished chr' + str(chrom) + '    ', end="\r", flush=True)

for chrom in range(chr_range, 23):
    downsample_gnomad_files(chrom)

#### Correct allele count to better reflect number of independent mutations <a name="GNOMAD_AC_correction"></a>
- AC > 1 can reflect either shared ancestry or independent mutations
- Correction here is simplified version of procedure detailed in Seplyarskiy et al, 2021, Science. (See Supplemental Methods)
- Numbers included here reflect model using sample size of 75,000 and u of 1x10-7 (~50x higher than the background mutation rate)

[Return to Table of Contents](#TOC)

In [None]:
AC_correction = pd.Series([0, 1, 1.426543, 1.608968, 1.698695, 1.749019, 1.779155, 1.797395, 1.808054, 1.813768, 1.816325, 1.816938, 1.816359, 1.815024, 1.813194, 1.811021, 1.808583], index = list(range(17)))

for chrom in range(chr_range,23):
    for base in ['A', 'T', 'G', 'C']:
        gnomad_slim_all[chrom][base] = AC_correction.reindex(gnomad_slim_all[chrom][base]).set_axis(gnomad_slim_all[chrom].index)
    print('finished chr' + str(chrom) + '      ', end="\r", flush=True)

#### Count total mutations in gnomAD database, using allele count <a name="GNOMAD_freq_AC"></a>
[Return to Table of Contents](#TOC)

In [None]:
# Generate counts
genome_AC_totals = dict()
for qc_cutoff in qc_cutoff_list:
    genome_AC_totals[qc_cutoff[0]] = dict()
    for chrom in range(chr_range,23):
        genome_AC_totals[qc_cutoff[0]][chrom] = (gnomad_slim_all[chrom].set_index('Tri')[['A', 'T', 'G', 'C']]).mul((gnomad_slim_all[chrom][['qual_A', 'qual_T', 'qual_G', 'qual_C']] > qc_cutoff[0]).astype(int).values, axis=0).mul((gnomad_slim_all[chrom][['inbr_A', 'inbr_T', 'inbr_G', 'inbr_C']] > qc_cutoff[1]).astype(int).values, axis=0).reset_index().groupby(['Tri']).sum().reset_index()
        print('finished qc' + str(qc_cutoff[0]) + ' chr' + str(chrom) + '         ', end="\r", flush=True)
    genome_AC_totals[qc_cutoff[0]] = pd.concat(genome_AC_totals[qc_cutoff[0]]).groupby(['Tri']).sum()
    print('finished qc' + str(qc_cutoff[0]) + '             ', end="\r", flush=True)

In [None]:
# Format counts
genome_AC_totals_format = dict()
for qc_cutoff in qc_cutoff_list:
    genome_AC_totals_format[qc_cutoff[0]] = genome_AC_totals[qc_cutoff[0]].reindex(all_triplets).stack()
    genome_AC_totals_format[qc_cutoff[0]].index = [tri+'_'+mut for tri, mut in zip(genome_AC_totals_format[qc_cutoff[0]].index.get_level_values(0), genome_AC_totals_format[qc_cutoff[0]].index.get_level_values(1))]
    genome_AC_totals_format[qc_cutoff[0]] = genome_AC_totals_format[qc_cutoff[0]].reindex(triplet_mutations_und)
genome_AC_totals_format = pd.concat(genome_AC_totals_format, axis=1)
genome_AC_totals_RC = triplet_combine_RC(genome_AC_totals_format, mut_input = True)

# Save
genome_AC_totals_RC.to_csv('./gnomad/gnomad31_triplet_ACcorrected_counts_chr'+str(chr_range)+'-22.csv')

In [None]:
# Load
genome_AC_totals_RC = pd.read_csv('./gnomad/gnomad31_triplet_ACcorrected_counts_chr'+str(chr_range)+'-22.csv', index_col = 0)
genome_AC_totals_RC.columns = [qc_cutoff[0] for qc_cutoff in qc_cutoff_list]
genome_triplet_totals = pd.read_csv('./hg38/triplet_totals_hg38_chr'+str(chr_range)+'-22.csv', index_col = 0)

# Calculate frequency
triplet_totals_RC_mut = triplet_combine_RC(genome_triplet_totals['hg38'], mut_output = True)
genome_AC_freq_RC = genome_AC_totals_RC.div(triplet_totals_RC_mut, axis=0)
genome_AC_freq_all = triplet_combine_RC(genome_AC_freq_RC, mut_input=True, decombine=True)

#### Ts/Tv ratio <a name="GNOMAD_freq_tstv"></a>
[Return to Table of Contents](#TOC)

In [None]:
triplet_mutations_und_Ts = [base1+base2+base3+'_'+mutbase for base1 in ['A', 'T', 'G', 'C'] for base2 in ['T', 'C'] for base3 in ['A', 'T', 'G', 'C'] for mutbase in ['T', 'C'] if mutbase != base2]
triplet_mutations_und_Tv = [mut for mut in triplet_mutations_und_TC if mut not in triplet_mutations_und_Ts]

ts_tv = pd.DataFrame()
ts_tv['AC'] = pd.Series([genome_AC_totals_RC[qc_cutoff[0]].loc[triplet_mutations_und_Ts].sum() / genome_AC_totals_RC[qc_cutoff[0]].loc[triplet_mutations_und_Tv].sum() for qc_cutoff in qc_cutoff_list], index = [qc_cutoff[0] for qc_cutoff in qc_cutoff_list])
ts_tv

### Genome nucleotide content (GC%) correction <a name="GNOMAD_GC_correction"></a>
- generate mutation rate correction factor based on GC%

#### Calculate GC content along the genome with a rolling window
[Return to Table of Contents](#TOC)

In [None]:
current_mut_gc = dict()
nmer = 51
for qc_cutoff in qc_cutoff_list:
    current_mut_gc[qc_cutoff[0]] = dict()
    for chrom in range(chr_range,23):
        current_mut_gc[qc_cutoff[0]][chrom] = pd.concat([gnomad_slim_all[chrom][['Tri', 'GC_'+str(nmer)]], (gnomad_slim_all[chrom][['A', 'T', 'G', 'C']]).mul((gnomad_slim_all[chrom][['qual_A', 'qual_T', 'qual_G', 'qual_C']] > qc_cutoff[0]).astype(int).values, axis=0).mul((gnomad_slim_all[chrom][['inbr_A', 'inbr_T', 'inbr_G', 'inbr_C']] > qc_cutoff[1]).astype(int).values, axis=0)], axis=1).groupby(['GC_'+str(nmer), 'Tri']).sum()
        print('\r' + str(chrom), end='        ')
    current_mut_gc[qc_cutoff[0]] = pd.concat(current_mut_gc[qc_cutoff[0]])
    current_mut_gc[qc_cutoff[0]] = current_mut_gc[qc_cutoff[0]].groupby(['GC_'+str(nmer), 'Tri']).sum().unstack().transpose()
    current_mut_gc[qc_cutoff[0]].index = current_mut_gc[qc_cutoff[0]].index.get_level_values('Tri') + '_' + current_mut_gc[qc_cutoff[0]].index.get_level_values(0)
    current_mut_gc[qc_cutoff[0]] = triplet_combine_RC(current_mut_gc[qc_cutoff[0]], mut_input=True).fillna(0)

In [None]:
current_mut_gc = pd.concat(current_mut_gc)

In [None]:
current_mut_gc.to_csv('./gnomad/mut_counts_ACcorrected_by_GCwindow_nmer51_chr'+str(chr_range)+'-22.csv')

#### Load output

In [None]:
combined_nucleotide_count = pd.read_csv('./hg38/hg38_triplet_counts_by_GCwindow_nmer51_chr'+str(chr_range)+'-22.csv', index_col = 0)
current_mut_gc = pd.read_csv('./gnomad/mut_counts_ACcorrected_by_GCwindow_nmer51_chr'+str(chr_range)+'-22.csv', index_col = [0,1])
combined_nucleotide_count.columns = combined_nucleotide_count.columns.astype(float)
current_mut_gc.columns = current_mut_gc.columns.astype(float)

#### Calculate correction factors

In [None]:
current_freq_gc = dict(); current_norm_gc = dict()
for qc_cutoff in qc_cutoff_list:
    current_freq_gc[qc_cutoff[0]] = current_mut_gc.loc[qc_cutoff[0]].dropna(how = 'all', axis=1).div(combined_nucleotide_count.dropna(how = 'all', axis=1), axis=0)
    current_norm_gc[qc_cutoff[0]] = current_freq_gc[qc_cutoff[0]].div(genome_AC_freq_RC[qc_cutoff[0]], axis=0)

In [None]:
gc_correction_bytri = dict()
for qc_cutoff in qc_cutoff_list:
    gc_correction_bytri[qc_cutoff[0]] = triplet_combine_RC(current_norm_gc[qc_cutoff[0]], mut_input = True, decombine = True).rolling(max(3, round(len(current_norm_gc[qc_cutoff[0]].columns)/10)), center = True, min_periods = 1, axis=1).mean().sort_index(axis=1).transpose()

In [None]:
gc_correction_bymut = dict()
for qc_cutoff in qc_cutoff_list:
    gc_correction_bymut[qc_cutoff[0]] = dict()
    for mut in triplets_by_mut.index:
        gc_correction_bymut[qc_cutoff[0]][mut] = pd.Series(np.ma.average(np.ma.MaskedArray(gc_correction_bytri[qc_cutoff[0]].transpose().loc[triplets_by_mut[mut]], mask=np.isnan(gc_correction_bytri[-np.inf].transpose().loc[triplets_by_mut[mut]])), weights=current_mut_gc.loc[qc_cutoff[0]].loc[triplets_by_mut[mut]].reindex(gc_correction_bytri[qc_cutoff[0]].index, axis=1), axis=0), index = gc_correction_bytri[qc_cutoff[0]].index)
    gc_correction_bymut[qc_cutoff[0]] = pd.concat(gc_correction_bymut[qc_cutoff[0]], axis=1).interpolate(axis=0, limit_direction = 'both')

In [None]:
gc_correction_bymut_all = dict()
for qc_cutoff in qc_cutoff_list:
    gc_correction_bymut_all[qc_cutoff[0]] = dict()
    for mut in triplets_by_mut.index:
        gc_correction_bymut_all[qc_cutoff[0]][mut] = gc_correction_bymut[qc_cutoff[0]][[mut]*len(triplets_by_mut[mut])]
        gc_correction_bymut_all[qc_cutoff[0]][mut].columns = triplets_by_mut[mut]
    gc_correction_bymut_all[qc_cutoff[0]] = pd.concat(gc_correction_bymut_all[qc_cutoff[0]], axis=1)
    gc_correction_bymut_all[qc_cutoff[0]].columns = gc_correction_bymut_all[qc_cutoff[0]].columns.get_level_values(1)
    gc_correction_bymut_all[qc_cutoff[0]] = triplet_combine_RC(gc_correction_bymut_all[qc_cutoff[0]].transpose(), mut_input=True, decombine=True).transpose()
    gc_correction_bymut_all[qc_cutoff[0]].index.name = 'GC_51'

#### Plot correction factors

In [None]:
genome_gc_figS8a = make_subplots(rows = 2, cols = 2, subplot_titles = ['no QC', 'pass', 'VQSLOD > 0', 'VQSLOD > 4'], shared_yaxes=True, shared_xaxes = True, vertical_spacing = 0.1, horizontal_spacing = 0.05)

for qc_cutoff, rowval, colval in zip (qc_cutoff_list, [1,1,2,2], [1,2,1,2]):
    for mut in triplets_by_mut.index:
        if mut != 'CpG_T':
            genome_gc_figS8a.add_trace(go.Scattergl(x = gc_correction_bymut[qc_cutoff[0]].index, y = gc_correction_bymut[qc_cutoff[0]][mut], name = mut, legendgroup = mut, showlegend = False if max(rowval, colval) > 1 else True, opacity = 0.7, line = dict(width = 5), marker = dict(color = colors['color'][mut])), row = rowval, col = colval)
        else:
            genome_gc_figS8a.add_trace(go.Scattergl(x = gc_correction_bymut[qc_cutoff[0]].index, y = gc_correction_bymut[qc_cutoff[0]][mut], name = mut, legendgroup = mut + 'CpG',  showlegend = False if max(rowval, colval) > 1 else True, opacity = 0.7, line = dict(dash = 'dash', width = 5), marker = dict(color = colors['color']['C_T'])), row = rowval, col = colval)

    for group in colors.index:
        for triplet in colors['ind'][group]:
            if triplet[1:3] != 'CG':
                genome_gc_figS8a.add_trace(go.Scattergl(x = gc_correction_bytri[qc_cutoff[0]].index, y = gc_correction_bytri[qc_cutoff[0]][triplet].replace(0, np.nan), name = triplet, legendgroup = group, showlegend = False, opacity = 0.1, marker = dict(color = colors['color'][group])), row = rowval, col = colval)
            else:
                genome_gc_figS8a.add_trace(go.Scattergl(x = gc_correction_bytri[qc_cutoff[0]].index, y = gc_correction_bytri[qc_cutoff[0]][triplet].replace(0, np.nan), name = triplet, legendgroup = group + 'CpG', showlegend = False, opacity = 0.1, line = dict(dash = 'dash'), marker = dict(color = colors['color'][group])), row = rowval, col = colval)
            
            
genome_gc_figS8a.add_annotation(text = 'correction factor', showarrow = False, textangle = 270, yref = 'paper', y = 0.5, xref = 'paper', x = -0.075, font = dict(size = 18))
genome_gc_figS8a.add_annotation(text = 'GC %', showarrow = False, yref = 'paper', y = -0.1, xref = 'paper', x = 0.5, font = dict(size = 18))
            
genome_gc_figS8a.update_yaxes(range = [-0.5,16.5])
genome_gc_figS8a.update_layout(height = 500, width = 800, margin = dict(l = 55, r = 5, b = 50, t = 30), legend = dict(font = dict(size = 14)))

genome_gc_figS8a.show()

In [None]:
genome_gc_figS8a.write_image('./plots/revision_gc_correction_fig_S8a.png', format='png', scale = 10, engine = 'orca')

### Downsampling gnomAD files <a name="GNOMAD_downsample"></a>
- downsampling adjusted for allele count

[Return to Table of Contents](#TOC)

#### Load downsampled gnomAD SNV database and calculate trinucleotide mutation frequency <a name="GNOMAD_freq"></a>

In [None]:
# Load SNV database
gnomad_slim_downsample = dict()
for sample_frac in [0.1, 0.01, 0.004]:
    gnomad_slim_downsample[sample_frac] = dict()
    for chrom in range(chr_range, 23):
        gnomad_slim_downsample[sample_frac][chrom] = pd.read_csv('./gnomad/gnomad31_slim_downsample_'+str(sample_frac)+'_chr'+str(chrom)+'_AC-tri-qual.csv.gz', compression = 'gzip')
        gnomad_slim_downsample[sample_frac][chrom].set_index('POS', inplace = True)
        print('loaded chr' + str(chrom) + '      ', end="\r", flush=True)

#### Correct allele count to better reflect number of independent mutations <a name="GNOMAD_downsample_AC_correction"></a>
- Numbers included here reflect model using sample size of 7,500 or 1,000 and u of 1x10-7 (~50x higher than the background mutation rate)

[Return to Table of Contents](#TOC)

In [None]:
AC_correction_downsample = dict()
AC_correction_downsample[0.1] = pd.Series([0, 1, 1.222990, 1.244288, 1.235160, 1.225334, 1.218234, 1.213315, 1.209864, 1.207440, 1.205780, 1.204691, 1.203996, 1.203544, 1.203226, 1.202975, 1.202767], index = list(range(17)))
AC_correction_downsample[0.01] = pd.Series([0, 1, 1.076811, 1.068972, 1.067307, 1.067242, 1.067137, 1.066722, 1.066099, 1.065402, 1.064703, 1.064018, 1.063345, 1.062688, 1.062059, 1.061476, 1.060957], index = list(range(17)))
AC_correction_downsample[0.004] = pd.Series([0, 1, 1.076811, 1.068972, 1.067307, 1.067242, 1.067137, 1.066722, 1.066099, 1.065402, 1.064703, 1.064018, 1.063345, 1.062688, 1.062059, 1.061476, 1.060957], index = list(range(17)))

In [None]:
for sample_frac in [0.1, 0.01, 0.004]:
    for chrom in range(chr_range,23):
        for base in ['A', 'T', 'G', 'C']:
            gnomad_slim_downsample[sample_frac][chrom][base] = AC_correction_downsample[sample_frac].reindex(gnomad_slim_all[chrom][base]).set_axis(gnomad_slim_all[chrom].index)
        print('finished chr' + str(chrom) + '      ', end="\r", flush=True)

#### Count total mutations in downsampled gnomAD database, using allele count

In [None]:
# Generate counts
genome_AC_totals_downsample = dict()
genome_AC_totals_downsample_format = dict()
genome_AC_totals_RC_downsample = dict()

for sample_frac in [0.1, 0.01, 0.004]:
    genome_AC_totals_downsample[sample_frac] = dict()
    for qc_cutoff in qc_cutoff_list:
        genome_AC_totals_downsample[sample_frac][qc_cutoff[0]] = dict()
        for chrom in range(chr_range,23):
            genome_AC_totals_downsample[sample_frac][qc_cutoff[0]][chrom] = (gnomad_slim_downsample[sample_frac][chrom].set_index('Tri')[['A', 'T', 'G', 'C']]).mul((gnomad_slim_downsample[sample_frac][chrom][['qual_A', 'qual_T', 'qual_G', 'qual_C']] > qc_cutoff[0]).astype(int).values, axis=0).mul((gnomad_slim_downsample[sample_frac][chrom][['inbr_A', 'inbr_T', 'inbr_G', 'inbr_C']] > qc_cutoff[1]).astype(int).values, axis=0).reset_index().groupby(['Tri']).sum().reset_index()
            print('finished qc' + str(qc_cutoff[0]) + ' chr' + str(chrom) + '         ', end="\r", flush=True)
        genome_AC_totals_downsample[sample_frac][qc_cutoff[0]] = pd.concat(genome_AC_totals_downsample[sample_frac][qc_cutoff[0]]).groupby(['Tri']).sum()
        print('finished qc' + str(qc_cutoff[0]) + '             ', end="\r", flush=True)

    # Format counts
    genome_AC_totals_downsample_format[sample_frac] = dict()
    for qc_cutoff in qc_cutoff_list:
        genome_AC_totals_downsample_format[sample_frac][qc_cutoff[0]] = genome_AC_totals_downsample[sample_frac][qc_cutoff[0]].reindex(all_triplets).stack()
        genome_AC_totals_downsample_format[sample_frac][qc_cutoff[0]].index = [tri+'_'+mut for tri, mut in zip(genome_AC_totals_downsample_format[sample_frac][qc_cutoff[0]].index.get_level_values(0), genome_AC_totals_downsample_format[sample_frac][qc_cutoff[0]].index.get_level_values(1))]
        genome_AC_totals_downsample_format[sample_frac][qc_cutoff[0]] = genome_AC_totals_downsample_format[sample_frac][qc_cutoff[0]].reindex(triplet_mutations_und)
    genome_AC_totals_downsample_format[sample_frac] = pd.concat(genome_AC_totals_downsample_format[sample_frac], axis=1)
    genome_AC_totals_RC_downsample[sample_frac] = triplet_combine_RC(genome_AC_totals_downsample_format[sample_frac], mut_input = True)

    # Save
    genome_AC_totals_RC_downsample[sample_frac].to_csv('./gnomad/gnomad31_triplet_ACcorrected_counts_downsample_'+str(sample_frac)+'_chr'+str(chr_range)+'-22.csv')

In [None]:
# Load and calculate frequency
genome_AC_totals_RC_downsample = dict()
genome_AC_freq_RC_downsample = dict()
genome_AC_freq_all_downsample = dict()
for sample_frac in [0.1, 0.01, 0.004]:
    genome_AC_totals_RC_downsample[sample_frac] = pd.read_csv('./gnomad/gnomad31_triplet_ACcorrected_counts_downsample_'+str(sample_frac)+'_chr'+str(chr_range)+'-22.csv', index_col = 0)
    genome_AC_totals_RC_downsample[sample_frac].columns = [qc_cutoff[0] for qc_cutoff in qc_cutoff_list]
    genome_AC_freq_RC_downsample[sample_frac] = genome_AC_totals_RC_downsample[sample_frac].div(triplet_totals_RC_mut, axis=0)
    genome_AC_freq_all_downsample[sample_frac] = triplet_combine_RC(genome_AC_freq_RC_downsample[sample_frac], mut_input=True, decombine=True)

In [None]:
# Ts/Tv ratio
ts_tv_downsample = pd.DataFrame()
for sample_frac in [0.1, 0.01, 0.004]:
    ts_tv_downsample[sample_frac] = pd.Series(genome_AC_totals_RC_downsample[sample_frac].loc[triplet_mutations_und_Ts].sum().div(genome_AC_totals_RC_downsample[sample_frac].loc[triplet_mutations_und_Tv].sum()))
ts_tv_downsample

## GNOMAD indels <a name="GNOMAD_indels"></a>

[Return to Table of Contents](#TOC)

#### Evaluate QC scores in GNOMAD indels, compared to SNVs

In [None]:
variants_chrom = pd.read_csv('./gnomad/all_variants_chr1.csv.gz', compression = 'gzip', usecols = ['FILTER', 'POS', 'REF', 'ALT', 'AC', 'AN', 'InbreedingCoeff', 'AS_VQSLOD'])
variants_rare = variants_chrom.loc[(variants_chrom['AN'] > variants_chrom['AN'].quantile(0.1)) & (variants_chrom['AC'] / variants_chrom['AN'] < 10**-4)].copy()
variants_indel = variants_rare.loc[(variants_rare['REF'].str.len() > 1) | (variants_rare['ALT'].str.len() > 1)].copy()
variants_snv = variants_rare.loc[(variants_rare['REF'].str.len() == 1) & (variants_rare['ALT'].str.len() == 1)].copy()

In [None]:
# Inbreeding coefficient cutoff of -0.3 is already in effect
variants_chrom.loc[variants_chrom['FILTER'] == 'AS_VQSR']['InbreedingCoeff'].min()

In [None]:
# AS_VSLOD cutoff for SNVs is -2.774, while for indels it is -1.0607
variants_snv.loc[variants_snv['FILTER'] == 'AS_VQSR']['AS_VQSLOD'].max(), variants_indel.loc[variants_indel['FILTER'] == 'AS_VQSR']['AS_VQSLOD'].max()

In [None]:
# AS_VQSLOD cutoff of 0 for SNVs is approximately equivalent for indels
variants_snv['AS_VQSLOD'].quantile(0.13355), variants_indel['AS_VQSLOD'].quantile(0.13355)

In [None]:
# AS_VQSLOD cutoff of 4 for SNVs is approximately equivalent to 1.4 for indels
variants_snv['AS_VQSLOD'].quantile(0.348), variants_indel['AS_VQSLOD'].quantile(0.348)

In [None]:
# Complex mutations are not included...
variants_indel.loc[(variants_indel['REF'].str.len() > 1) & (variants_indel['ALT'].str.len() > 1)].copy()

#### Generate indel database <a name="GNOMAD_indels_makedb"></a>
- Load GNOMAD files, select indels

[Return to Table of Contents](#TOC)

In [None]:
for chrom in range(chr_range,23):
    # load GNOMAD file
    variants_chrom = pd.read_csv('./gnomad/all_variants_chr' + str(chrom) + '.csv.gz', compression = 'gzip', usecols = ['POS', 'REF', 'ALT', 'AC', 'AN', 'AS_VQSLOD'])
    # select variants with a frequency of 10^-4
    variants_rare = variants_chrom.loc[(variants_chrom['AN'] > variants_chrom['AN'].quantile(0.1)) & (variants_chrom['AC'] / variants_chrom['AN'] < 10**-4)].copy()
    # select indels
    variants_indel = variants_rare.loc[(variants_rare['REF'].str.len() > 1) | (variants_rare['ALT'].str.len() > 1)].copy()
    variants_indel.to_csv('./gnomad/indels_rare_chr' + str(chrom) + '.csv.gz', compression = 'gzip', index = False)
    print('finished chr' + str(chrom) + '      ', end="\r", flush=True)

#### Load indels, format, count genome totals <a name="GNOMAD_indels_load"></a>

[Return to Table of Contents](#TOC)

In [None]:
# QC cutoff values for indels (see above)
vqslod_list_indel = [-np.inf, -1.0607, 0, 1.4]

In [None]:
variants_indel_slim_AC = dict()
variants_indel_slim_noAC = dict()
variants_indel_count_AC = dict()
variants_indel_count_noAC = dict()
for chrom in range(chr_range,23):
    # Load indels
    variants_indel = pd.read_csv('./gnomad/indels_rare_chr' + str(chrom) + '.csv.gz', compression = 'gzip', usecols = ['POS', 'REF', 'ALT', 'AC', 'AN', 'AS_VQSLOD'])
    # Center position for deletions (doesn't change position for insertions)
    variants_indel['POS_mid'] = (variants_indel['POS'] + (variants_indel['REF'].str.len()/2)).astype(int)
    variants_indel['indel'] = (variants_indel['REF'].str.len() > 1).replace(False, 'ins').replace(True, 'del')
    variants_indel['Tri_mid'] = [tri_function(chrom, pos) for pos in variants_indel['POS_mid']]

    variants_indel_slim_AC[chrom] = dict()
    variants_indel_slim_noAC[chrom] = dict()
    variants_indel_count_AC[chrom] = dict()
    variants_indel_count_noAC[chrom] = dict()
    for qc_cutoff in vqslod_list_indel:
        variants_indel['AC_cutoff'] = variants_indel['AC'].mul((variants_indel['AS_VQSLOD'] >= qc_cutoff).values, axis=0)
        variants_indel['count_cutoff'] = (variants_indel['AC'] > 0).astype(int).mul((variants_indel['AS_VQSLOD'] >= qc_cutoff).values, axis=0)
        variants_indel_slim_AC[chrom][qc_cutoff] = variants_indel.groupby(['POS_mid', 'indel']).sum()['AC_cutoff'].unstack().fillna(0).astype(int)
        variants_indel_slim_noAC[chrom][qc_cutoff] = variants_indel.groupby(['POS_mid', 'indel']).sum()['count_cutoff'].unstack().fillna(0).astype(int)

        variants_indel_count_AC[chrom][qc_cutoff] = variants_indel.groupby(['Tri_mid', 'indel']).sum()['AC_cutoff'].unstack().fillna(0).astype(int)
        variants_indel_count_noAC[chrom][qc_cutoff] = variants_indel.groupby(['Tri_mid', 'indel']).sum()['count_cutoff'].unstack().fillna(0).astype(int)
    print('finished chr' + str(chrom) + '      ', end="\r", flush=True)

for chrom in range(chr_range,23):
    variants_indel_slim_AC[chrom] = pd.concat(variants_indel_slim_AC[chrom], axis=1)
    variants_indel_slim_noAC[chrom] = pd.concat(variants_indel_slim_noAC[chrom], axis=1)
    
    variants_indel_count_AC[chrom] = pd.concat(variants_indel_count_AC[chrom], axis=1)
    variants_indel_count_noAC[chrom] = pd.concat(variants_indel_count_noAC[chrom], axis=1)

In [None]:
# Count total indels for each QC cutoff, with and without using allele counts
variants_indel_count_AC_sum = variants_indel_count_AC[22]
for chrom in range(chr_range,22):
    variants_indel_count_AC_sum += variants_indel_count_AC[chrom]

variants_indel_count_noAC_sum = variants_indel_count_noAC[22]
for chrom in range(chr_range,22):
    variants_indel_count_noAC_sum += variants_indel_count_noAC[chrom]

variants_indel_count_AC_sum_RC = triplet_combine_RC(variants_indel_count_AC_sum)
variants_indel_count_AC_freq_RC = variants_indel_count_AC_sum_RC.div(triplet_combine_RC(genome_triplet_totals['hg38']), axis=0)

variants_indel_count_AC_freq_RC = variants_indel_count_AC_freq_RC.stack()
variants_indel_count_AC_freq_RC.index = [ind[0] + '_' + ind[1] for ind in variants_indel_count_AC_freq_RC.index]
variants_indel_count_AC_freq_RC = variants_indel_count_AC_freq_RC.reindex(triplet_mutations_und_indel_TC)
variants_indel_count_AC_freq_all = triplet_combine_RC_indel(variants_indel_count_AC_freq_RC, mut_input=True, decombine=True)

In [None]:
# Load all indels into a single object
variants_indel_all = dict()
for chrom in range(chr_range,23):
    # Load indels
    variants_indel_all[chrom] = pd.read_csv('./gnomad/indels_rare_chr' + str(chrom) + '.csv.gz', compression = 'gzip', usecols = ['POS', 'REF', 'ALT', 'AC', 'AN', 'AS_VQSLOD'])
    # Center position for deletions (doesn't change position for insertions)
    variants_indel_all[chrom]['POS_mid'] = (variants_indel_all[chrom]['POS'] + (variants_indel_all[chrom]['REF'].str.len()/2)).astype(int)
    variants_indel_all[chrom]['indel'] = (variants_indel_all[chrom]['REF'].str.len() > 1).replace(False, 'ins').replace(True, 'del')
    variants_indel_all[chrom]['Tri_mid'] = [tri_function(chrom, pos) for pos in variants_indel_all[chrom]['POS_mid']]

In [None]:
# Total number of indels in gnomAD
pd.Series([len(variants_indel_all[chrom]) for chrom in range(chr_range,23)]).sum()

In [None]:
# Reformat indel list for easier access, and count indels

# QC cutoff values for indels (see above)
vqslod_list_indel = [-np.inf, -1.0607, 0, 1.4]

variants_indel_slim_AC = dict()
variants_indel_slim_noAC = dict()
variants_indel_count_AC = dict()
variants_indel_count_noAC = dict()
for chrom in range(chr_range,23):
    variants_indel_slim_AC[chrom] = dict()
    variants_indel_slim_noAC[chrom] = dict()
    variants_indel_count_AC[chrom] = dict()
    variants_indel_count_noAC[chrom] = dict()
    for qc_cutoff in vqslod_list_indel:
        variants_indel_all[chrom]['AC_cutoff'] = variants_indel_all[chrom]['AC'].mul((variants_indel_all[chrom]['AS_VQSLOD'] >= qc_cutoff).values, axis=0)
        variants_indel_all[chrom]['count_cutoff'] = (variants_indel_all[chrom]['AC'] > 0).astype(int).mul((variants_indel_all[chrom]['AS_VQSLOD'] >= qc_cutoff).values, axis=0)
        variants_indel_slim_AC[chrom][qc_cutoff] = variants_indel_all[chrom].groupby(['POS_mid', 'indel']).sum()['AC_cutoff'].unstack().fillna(0).astype(int)
        variants_indel_slim_noAC[chrom][qc_cutoff] = variants_indel_all[chrom].groupby(['POS_mid', 'indel']).sum()['count_cutoff'].unstack().fillna(0).astype(int)

        variants_indel_count_AC[chrom][qc_cutoff] = variants_indel_all[chrom].groupby(['Tri_mid', 'indel']).sum()['AC_cutoff'].unstack().fillna(0).astype(int)
        variants_indel_count_noAC[chrom][qc_cutoff] = variants_indel_all[chrom].groupby(['Tri_mid', 'indel']).sum()['count_cutoff'].unstack().fillna(0).astype(int)
    print('finished chr' + str(chrom) + '      ', end="\r", flush=True)

for chrom in range(chr_range,23):
    variants_indel_slim_AC[chrom] = pd.concat(variants_indel_slim_AC[chrom], axis=1)
    variants_indel_slim_noAC[chrom] = pd.concat(variants_indel_slim_noAC[chrom], axis=1)
    
    variants_indel_count_AC[chrom] = pd.concat(variants_indel_count_AC[chrom], axis=1)
    variants_indel_count_noAC[chrom] = pd.concat(variants_indel_count_noAC[chrom], axis=1)

#### Insertions and deletions longer/shorter than 5 nt

In [None]:
variants_ins_long = dict(); variants_ins_short = dict()
for chrom in range(chr_range,23):
    variants_ins_short[chrom] = variants_indel_all[chrom].loc[(variants_indel_all[chrom]['ALT'].str.len() > 1) & (variants_indel_all[chrom]['ALT'].str.len() <= 5)].copy()
    variants_ins_long[chrom] = variants_indel_all[chrom].loc[variants_indel_all[chrom]['ALT'].str.len() > 5].copy()

variants_del_long = dict(); variants_del_short = dict()
for chrom in range(chr_range,23):
    variants_del_short[chrom] = variants_indel_all[chrom].loc[(variants_indel_all[chrom]['REF'].str.len() > 1) & (variants_indel_all[chrom]['REF'].str.len() <= 5)].copy()
    variants_del_long[chrom] = variants_indel_all[chrom].loc[variants_indel_all[chrom]['REF'].str.len() > 5].copy()

In [None]:
# Reformat indel list for easier access, and count indels

# QC cutoff values for indels (see above)
vqslod_list_indel = [-np.inf, -1.0607, 0, 1.4]

variants_ins_slim_AC_long = dict(); variants_ins_slim_AC_short = dict()
variants_ins_count_AC_long = dict(); variants_ins_count_AC_short = dict()
for chrom in range(chr_range,23):
    variants_ins_slim_AC_long[chrom] = dict(); variants_ins_slim_AC_short[chrom] = dict()
    variants_ins_count_AC_long[chrom] = dict(); variants_ins_count_AC_short[chrom] = dict()
    for qc_cutoff in vqslod_list_indel:
        variants_ins_long[chrom]['AC_cutoff'] = variants_ins_long[chrom]['AC'].mul((variants_ins_long[chrom]['AS_VQSLOD'] >= qc_cutoff).values, axis=0)
        variants_ins_long[chrom]['count_cutoff'] = (variants_ins_long[chrom]['AC'] > 0).astype(int).mul((variants_ins_long[chrom]['AS_VQSLOD'] >= qc_cutoff).values, axis=0)
        variants_ins_slim_AC_long[chrom][qc_cutoff] = variants_ins_long[chrom].groupby(['POS_mid', 'indel']).sum()['AC_cutoff'].unstack().fillna(0).astype(int)
        variants_ins_count_AC_long[chrom][qc_cutoff] = variants_ins_long[chrom].groupby(['Tri_mid', 'indel']).sum()['AC_cutoff'].unstack().fillna(0).astype(int)

        variants_ins_short[chrom]['AC_cutoff'] = variants_ins_short[chrom]['AC'].mul((variants_ins_short[chrom]['AS_VQSLOD'] >= qc_cutoff).values, axis=0)
        variants_ins_short[chrom]['count_cutoff'] = (variants_ins_short[chrom]['AC'] > 0).astype(int).mul((variants_ins_short[chrom]['AS_VQSLOD'] >= qc_cutoff).values, axis=0)
        variants_ins_slim_AC_short[chrom][qc_cutoff] = variants_ins_short[chrom].groupby(['POS_mid', 'indel']).sum()['AC_cutoff'].unstack().fillna(0).astype(int)
        variants_ins_count_AC_short[chrom][qc_cutoff] = variants_ins_short[chrom].groupby(['Tri_mid', 'indel']).sum()['AC_cutoff'].unstack().fillna(0).astype(int)
    print('finished chr' + str(chrom) + '      ', end="\r", flush=True)

for chrom in range(chr_range,23):
    variants_ins_slim_AC_long[chrom] = pd.concat(variants_ins_slim_AC_long[chrom], axis=1)
    variants_ins_slim_AC_short[chrom] = pd.concat(variants_ins_slim_AC_short[chrom], axis=1)
    variants_ins_count_AC_long[chrom] = pd.concat(variants_ins_count_AC_long[chrom], axis=1)
    variants_ins_count_AC_short[chrom] = pd.concat(variants_ins_count_AC_short[chrom], axis=1)

In [None]:
# Count total indels for each QC cutoff, with and without using allele counts
variants_ins_count_AC_sum_long = variants_ins_count_AC_long[22]
for chrom in range(chr_range,22):
    variants_ins_count_AC_sum_long += variants_ins_count_AC_long[chrom]

variants_ins_count_AC_sum_RC_long = triplet_combine_RC(variants_ins_count_AC_sum_long)
variants_ins_count_AC_freq_RC_long = variants_ins_count_AC_sum_RC_long.div(triplet_combine_RC(genome_triplet_totals['hg38']), axis=0)

variants_ins_count_AC_freq_RC_long = variants_ins_count_AC_freq_RC_long.stack()
variants_ins_count_AC_freq_RC_long.index = [ind[0] + '_' + ind[1] for ind in variants_ins_count_AC_freq_RC_long.index]
variants_ins_count_AC_freq_RC_long = variants_ins_count_AC_freq_RC_long.reindex(triplet_mutations_und_indel_TC)
variants_ins_count_AC_freq_long = triplet_combine_RC_indel(variants_ins_count_AC_freq_RC_long, mut_input=True, decombine=True)

# Count total indels for each QC cutoff, with and without using allele counts
variants_ins_count_AC_sum_short = variants_ins_count_AC_short[22]
for chrom in range(chr_range,22):
    variants_ins_count_AC_sum_short += variants_ins_count_AC_short[chrom]

variants_ins_count_AC_sum_RC_short = triplet_combine_RC(variants_ins_count_AC_sum_short)
variants_ins_count_AC_freq_RC_short = variants_ins_count_AC_sum_RC_short.div(triplet_combine_RC(genome_triplet_totals['hg38']), axis=0)

variants_ins_count_AC_freq_RC_short = variants_ins_count_AC_freq_RC_short.stack()
variants_ins_count_AC_freq_RC_short.index = [ind[0] + '_' + ind[1] for ind in variants_ins_count_AC_freq_RC_short.index]
variants_ins_count_AC_freq_RC_short = variants_ins_count_AC_freq_RC_short.reindex(triplet_mutations_und_indel_TC)
variants_ins_count_AC_freq_short = triplet_combine_RC_indel(variants_ins_count_AC_freq_RC_short, mut_input=True, decombine=True)

In [None]:
# QC cutoff values for indels (see above)
vqslod_list_indel = [-np.inf, -1.0607, 0, 1.4]

variants_del_slim_AC_long = dict(); variants_del_slim_AC_short = dict()
variants_del_count_AC_long = dict(); variants_del_count_AC_short = dict()
for chrom in range(chr_range,23):
    variants_del_slim_AC_long[chrom] = dict(); variants_del_slim_AC_short[chrom] = dict()
    variants_del_count_AC_long[chrom] = dict(); variants_del_count_AC_short[chrom] = dict()
    for qc_cutoff in vqslod_list_indel:
        variants_del_long[chrom]['AC_cutoff'] = variants_del_long[chrom]['AC'].mul((variants_del_long[chrom]['AS_VQSLOD'] >= qc_cutoff).values, axis=0)
        variants_del_long[chrom]['count_cutoff'] = (variants_del_long[chrom]['AC'] > 0).astype(int).mul((variants_del_long[chrom]['AS_VQSLOD'] >= qc_cutoff).values, axis=0)
        variants_del_slim_AC_long[chrom][qc_cutoff] = variants_del_long[chrom].groupby(['POS_mid', 'indel']).sum()['AC_cutoff'].unstack().fillna(0).astype(int)
        variants_del_count_AC_long[chrom][qc_cutoff] = variants_del_long[chrom].groupby(['Tri_mid', 'indel']).sum()['AC_cutoff'].unstack().fillna(0).astype(int)

        variants_del_short[chrom]['AC_cutoff'] = variants_del_short[chrom]['AC'].mul((variants_del_short[chrom]['AS_VQSLOD'] >= qc_cutoff).values, axis=0)
        variants_del_short[chrom]['count_cutoff'] = (variants_del_short[chrom]['AC'] > 0).astype(int).mul((variants_del_short[chrom]['AS_VQSLOD'] >= qc_cutoff).values, axis=0)
        variants_del_slim_AC_short[chrom][qc_cutoff] = variants_del_short[chrom].groupby(['POS_mid', 'indel']).sum()['AC_cutoff'].unstack().fillna(0).astype(int)
        variants_del_count_AC_short[chrom][qc_cutoff] = variants_del_short[chrom].groupby(['Tri_mid', 'indel']).sum()['AC_cutoff'].unstack().fillna(0).astype(int)
    print('finished chr' + str(chrom) + '      ', end="\r", flush=True)

for chrom in range(chr_range,23):
    variants_del_slim_AC_long[chrom] = pd.concat(variants_del_slim_AC_long[chrom], axis=1)
    variants_del_slim_AC_short[chrom] = pd.concat(variants_del_slim_AC_short[chrom], axis=1)
    variants_del_count_AC_long[chrom] = pd.concat(variants_del_count_AC_long[chrom], axis=1)
    variants_del_count_AC_short[chrom] = pd.concat(variants_del_count_AC_short[chrom], axis=1)

# Count total indels for each QC cutoff, with and without using allele counts
variants_del_count_AC_sum_long = variants_del_count_AC_long[22]
for chrom in range(chr_range,22):
    variants_del_count_AC_sum_long += variants_del_count_AC_long[chrom]

variants_del_count_AC_sum_RC_long = triplet_combine_RC(variants_del_count_AC_sum_long)
variants_del_count_AC_freq_RC_long = variants_del_count_AC_sum_RC_long.div(triplet_combine_RC(genome_triplet_totals['hg38']), axis=0)

variants_del_count_AC_freq_RC_long = variants_del_count_AC_freq_RC_long.stack()
variants_del_count_AC_freq_RC_long.index = [ind[0] + '_' + ind[1] for ind in variants_del_count_AC_freq_RC_long.index]
variants_del_count_AC_freq_RC_long = variants_del_count_AC_freq_RC_long.reindex(triplet_mutations_und_indel_TC)
variants_del_count_AC_freq_long = triplet_combine_RC_indel(variants_del_count_AC_freq_RC_long, mut_input=True, decombine=True)

# Count total indels for each QC cutoff, with and without using allele counts
variants_del_count_AC_sum_short = variants_del_count_AC_short[22]
for chrom in range(chr_range,22):
    variants_del_count_AC_sum_short += variants_del_count_AC_short[chrom]

variants_del_count_AC_sum_RC_short = triplet_combine_RC(variants_del_count_AC_sum_short)
variants_del_count_AC_freq_RC_short = variants_del_count_AC_sum_RC_short.div(triplet_combine_RC(genome_triplet_totals['hg38']), axis=0)

variants_del_count_AC_freq_RC_short = variants_del_count_AC_freq_RC_short.stack()
variants_del_count_AC_freq_RC_short.index = [ind[0] + '_' + ind[1] for ind in variants_del_count_AC_freq_RC_short.index]
variants_del_count_AC_freq_RC_short = variants_del_count_AC_freq_RC_short.reindex(triplet_mutations_und_indel_TC)
variants_del_count_AC_freq_short = triplet_combine_RC_indel(variants_del_count_AC_freq_RC_short, mut_input=True, decombine=True)

In [None]:
# GC correction for indels not implemented

# Prepare de novo SNV database  <a name="denovo"></a>
- Gather de novo data from all available public sources (trio/family sequencing)
- Place files in directory './denovo/download/'

[Return to Table of Contents](#TOC)

### de novo data aligned to hg19 <a name="denovo_hg19"></a>

#### data from Goldman 2016 (hg19)
- Parent-of-origin-specific signatures of de novo mutations
- phased
- 816 trios

In [None]:
trio_gold = pd.read_excel('./denovo/download/41588_2016_BFng3597_MOESM69_ESM_hg19.xlsx', usecols = ['Chromosome', 'Start.position', 'Reference', 'Variant', 'parentOfOrigin'])
trio_gold.columns = ['chrom', 'pos', 'ref', 'alt', 'parent']

#### data from Goes et al 2021 (hg19)
- De novo variation in bipolar disorder
- unphased
- 97 trios

In [None]:
goes2019 = pd.read_excel('./denovo/download/41380_2019_611_MOESM3_ESM.xlsx', usecols = ['chr_bp_ref_alt', 'SNV'])
goes2019[['chrom', 'pos', 'ref', 'alt']] = goes2019['chr_bp_ref_alt'].str.split('_', expand = True)
goes2019 = goes2019.dropna()[['chrom', 'pos', 'ref', 'alt']]
goes2019['chrom'] = ['chr' + str(chrom) for chrom in goes2019['chrom']]

#### data from Yuen et al. 2016 (hg19)
- Genome-wide characteristics of de novo mutations in autism
- 192 trios
- phased

In [None]:
yuen_trios = pd.read_excel('./denovo/download/41525_2016_BFnpjgenmed201627_MOESM431_ESM.xlsx', sheet_name='Table S4', skiprows = 1, usecols = ['Chromosome', 'Start', 'Reference', 'Allel', 'Parental Origin'])
yuen_trios.columns = ['chrom', 'pos', 'ref', 'alt', 'parent']

#### data from Sasani 2019 (hg19)
- Large, three-generation human families reveal post-zygotic mosaicism and variability in germline mutation accumulation
- phased
- 350 3rd generation offspring, 70 2nd generation offspring (420 genomes)

In [None]:
dnm_ceph_gen2 = pd.read_csv('./denovo/download/ceph-dnm-manuscript-master/data/second_gen.dnms.txt', sep = '\t', usecols = ['chrom', 'start', 'ref', 'alt', 'paternal_age_at_birth', 'maternal_age_at_birth', 'phase'])
dnm_ceph_gen3 = pd.read_csv('./denovo/download/ceph-dnm-manuscript-master/data/third_gen.dnms.txt', sep = '\t', usecols = ['chrom', 'start', 'ref', 'alt', 'paternal_age_at_birth', 'maternal_age_at_birth', 'phase'])
dnm_ceph_gen2_gon = pd.read_csv('./denovo/download/ceph-dnm-manuscript-master/data/gonosomal.dnms.txt', sep = '\t', usecols = ['chrom', 'start', 'ref', 'alt', 'paternal_age_at_birth', 'maternal_age_at_birth', 'phase'])
dnm_ceph_gen3_gon = pd.read_csv('./denovo/download/ceph-dnm-manuscript-master/data/post-pgcs.dnms.txt', sep = '\t', usecols = ['chrom', 'start', 'ref', 'alt', 'paternal_age_at_birth', 'maternal_age_at_birth', 'phase'])

dnm_ceph = pd.concat([dnm_ceph_gen2, dnm_ceph_gen2_gon, dnm_ceph_gen3, dnm_ceph_gen3_gon])
dnm_ceph['chrom'] = ['chr' + str(chrom) for chrom in dnm_ceph['chrom']]

dnm_ceph.columns = ['chrom', 'pos', 'ref', 'alt', 'paternal_age', 'maternal_age', 'phase']

#### Liftover hg19 data to hg38

In [None]:
trio_hg19 = pd.concat([trio_gold, goes2019, dnm_ceph, yuen_trios]).reset_index(drop = True)
trio_hg19['pos'] = trio_hg19['pos'].astype(int)

In [None]:
# Liftover to hg38
lo = LiftOver('hg19', 'hg38')
trio_hg19['hg38_lo'] = [lo.convert_coordinate(chrom, pos) for chrom,pos in zip(trio_hg19['chrom'], trio_hg19['pos'])]
trio_hg19['hg38_chr'] = [pos[0][0] if len(pos) >0 else np.nan for pos in trio_hg19['hg38_lo']]
trio_hg19['hg38_pos'] = [pos[0][1] if len(pos) >0 else np.nan for pos in trio_hg19['hg38_lo']]

trio_hg19 = trio_hg19.dropna(subset = ['hg38_chr'], axis=0)
trio_hg19 = trio_hg19.loc[trio_hg19['hg38_chr'].isin(['chr'+str(n) for n in range(1,23)])].copy()
trio_hg19['hg38_chr'] = [chrom[3:] for chrom in trio_hg19['hg38_chr']]
trio_hg19['hg38_chr'] = trio_hg19['hg38_chr'].astype(int)

trio_hg19['hg38_pos'] = trio_hg19['hg38_pos'].astype(int)

In [None]:
trio_hg19['hg38_base'] = [reference_lookup(c,p,0) for c,p in zip(trio_hg19['hg38_chr'], trio_hg19['hg38_pos'])]

trio_hg19 = trio_hg19.loc[trio_hg19['ref'] == trio_hg19['hg38_base']].copy()

trio_hg19 = trio_hg19[['hg38_chr', 'hg38_pos', 'ref', 'alt', 'parent']].copy()
trio_hg19.columns = ['chrom', 'pos', 'ref', 'alt', 'parent']

### de novo data aligned to hg38 <a name="denovo_hg38"></a>

[Return to Table of Contents](#TOC)

#### data from Halldorsson 2019 in Science (hg38)
- Characterizing mutagenic effects of recombination through a sequence-level genetic map
- phased
- 2976 trios

In [None]:
trio_set1 = pd.read_csv('./denovo/download/aau1043_DataS5_revision1.tsv', skiprows = 11, sep = '\t', low_memory=False)
trio_set1_ages = pd.read_csv('./denovo/download/aau1043_DataS7.tsv', skiprows = 4, sep = '\t')

trio_set1_ages.index = trio_set1_ages['Proband_id']
trio_set1['Father_age'] = [trio_set1_ages['Father_age'][pro] for pro in trio_set1['Proband_id']]
trio_set1['Mother_age'] = [trio_set1_ages['Mother_age'][pro] for pro in trio_set1['Proband_id']]

trio_set1['Chr'] = pd.Series([chrom[3:] for chrom in trio_set1['Chr']]).astype(int)

trio_set1 = trio_set1[['Chr', 'Pos', 'Ref', 'Alt', 'Phase_combined', 'Father_age', 'Mother_age']]
trio_set1.columns = ['chrom', 'pos', 'ref', 'alt', 'parent', 'paternal_age', 'maternal_age']

#### data from Jonnson 2017 in Nature (hg38)
- Parental influence on human germline de novo mutations in 1,548 trios from Iceland
- 1548 trios
- phased

In [None]:
trio_set2 = pd.read_csv('./denovo/download/decode_DNMs.tsv', sep = '\t', usecols = ['Chr', 'Pos_hg38', 'Ref', 'Alt', 'Discordant_in_3_gen_or_mz_twins', 'Fathers_age_at_conception', 'Mothers_age_at_conception', 'Phase_combined'])

trio_set2['Chr'] = [chrom[3:] for chrom in trio_set2['Chr']]
trio_set2 = trio_set2.loc[trio_set2['Chr'].isin([str(n) for n in range(1,23)])].copy()
trio_set2['Chr'] = trio_set2['Chr'].astype(int)

trio_set2.columns = ['chrom', 'pos', 'ref', 'alt', 'discordant', 'paternal_age', 'maternal_age', 'parent']
trio_set2 = trio_set2.loc[trio_set2['discordant'] != 'Discordant']
del trio_set2['discordant']

#### data from AN 2018 in Science (hg38)
- Genome-wide de novo risk score implicates promoter variation in autism spectrum disorder
- unphased
- 1902 quartets (3804 genomes)

In [None]:
trio_an = pd.read_excel('./denovo/download/aat6576_Table-S2_hg38_notphased.xlsx', skiprows = 1, usecols = ['Chr', 'Pos', 'Ref', 'Alt'])

trio_an['Chr'] = pd.Series([chrom[3:] for chrom in trio_an['Chr']]).astype(int)
trio_an.columns = ['chrom', 'pos', 'ref', 'alt']

#### data from Jonsson et al 2021 (hg38)
- Differences between germline genomes of monozygotic twins
- 451 offspring in quads
- 608 offspring in three-generation approach

In [None]:
# de novo mutations from quads
jonsson_quads = pd.read_csv('./denovo/download/41588_2020_755_MOESM5_ESM.tsv', sep = '\t', usecols = ['Chr', 'Pos', 'Ref', 'Alt', 'Child_Fathers_Age_at_birth', 'Child_Mothers_Age_at_birth'])
jonsson_quads.columns = ['chrom', 'pos', 'ref', 'alt', 'paternal_age', 'maternal_age']

# de novo mutations from three-generation approach
jonsson_3gen = pd.read_csv('./denovo/download/41588_2020_755_MOESM4_ESM.tsv', sep = '\t', usecols = ['Chr', 'Pos', 'Ref', 'Alt'])
jonsson_3gen.columns = ['chrom', 'pos', 'ref', 'alt']

jonsson_quads['chrom'] = [chrom[3:] for chrom in jonsson_quads['chrom']]
jonsson_quads = jonsson_quads.loc[jonsson_quads['chrom'].isin([str(n) for n in range(1,23)])].copy()
jonsson_quads['chrom'] = jonsson_quads['chrom'].astype(int)

jonsson_3gen['chrom'] = [chrom[3:] for chrom in jonsson_3gen['chrom']]
jonsson_3gen = jonsson_3gen.loc[jonsson_3gen['chrom'].isin([str(n) for n in range(1,23)])].copy()
jonsson_3gen['chrom'] = jonsson_3gen['chrom'].astype(int)

### Combine all into single de novo database <a name="denovo_combine"></a>
[Return to Table of Contents](#TOC)

In [None]:
denovo_combined = pd.concat([trio_set1, trio_set2, trio_an, jonsson_3gen, jonsson_quads, trio_hg19]).reset_index(drop = True)
denovo_combined = denovo_combined.loc[(denovo_combined['ref'].str.len() == 1) & (denovo_combined['alt'].str.len() == 1)].copy()
denovo_combined['parent'] = denovo_combined['parent'].str.lower()
denovo_combined['parent'] = denovo_combined['parent'].fillna('unassigned')
denovo_combined['tri'] = [tri_function(chrom, pos) for chrom, pos in zip(denovo_combined['chrom'], denovo_combined['pos'])]

In [None]:
# confirm that all coordinates are base1
len(denovo_combined) == len(denovo_combined.loc[[tri[1] == ref for tri, ref in zip(denovo_combined['tri'], denovo_combined['ref'])]])

In [None]:
# Save databse
denovo_combined.to_csv('./denovo/all_denovo_snvs.csv', index = False)

### Load database and calculate SNV frequencies <a name="denovo_load"></a>
[Return to Table of Contents](#TOC)

In [None]:
denovo_combined = pd.read_csv('./denovo/all_denovo_snvs.csv')

In [None]:
# Total number of genomes
denovo_n_genomes = 816 + 97 + 192 + 420 + 2976+ 1548 + 3804 + 451 + 608

# Total number of SNVs in database, SNVs per genome
len(denovo_combined), denovo_n_genomes, len(denovo_combined) / denovo_n_genomes

#### Calculate mutation frequency per trinucleotide

In [None]:
# Total number of mutations per triplet context
denovo_totals = denovo_combined.groupby(['tri', 'alt']).count()['ref']
denovo_totals.index = [tri+'_'+mut for tri, mut in zip(denovo_totals.index.get_level_values(0), denovo_totals.index.get_level_values(1))]
denovo_totals = denovo_totals.reindex(triplet_mutations_und)
denovo_totals_RC = triplet_combine_RC(denovo_totals, mut_input = True)

# Calculate frequency
denovo_freq_RC = denovo_totals_RC.div(triplet_combine_RC(genome_triplet_totals['hg38'], mut_output = True), axis=0)
denovo_freq_all = pd.DataFrame(triplet_combine_RC(denovo_freq_RC, mut_input=True, decombine=True), columns = ['denovo'])

#### Reformat mutations

In [None]:
denovo_slim = denovo_combined.groupby(['chrom', 'pos', 'tri', 'alt']).count()['ref'].unstack().fillna(0).astype(int)
denovo_slim_reformat = dict()
for chrom in range(chr_range,23):
    denovo_slim_reformat[chrom] = denovo_slim.loc[1].reset_index().set_index('pos')

#    Calculate mutation frequency surrounding motifs <a name="mutation_surrounding"></a>

### Define counting functions <a name="mutation_surrounding_functions"></a>

[Return to Table of Contents](#TOC)

In [None]:
def count_mut_flank_chrom(chrom, current_pos_chrom, input_mut_dict, distance, left_pos_col, right_pos_col, filter_cols, filter_distance, useful_cols, qc_cutoff_list, noAC, gc_nmer, ignore_qual):
    print('\r' + str(chrom), end='        ')
    
    # filter out motifs too close to masked elements
    if len(filter_cols) > 0:
        current_pos_chrom = current_pos_chrom.loc[current_pos_chrom[[col + '_min' for col in filter_cols]].fillna(0).min(axis=1) > filter_distance].copy()
    
    # make a single dataframe consisting of all search coordinates and their position relative to the original starts/ends of interest
    
    search_positions_left = dict()
    for pos in range(-distance,0):
        search_positions_left[pos] = pd.DataFrame(current_pos_chrom[left_pos_col] + pos)
    search_positions_left = pd.concat(search_positions_left).reset_index()[[left_pos_col, 'level_0']]
    for col in useful_cols:
        search_positions_left[col] = list(current_pos_chrom[col])*distance
    if len(filter_cols) > 0:
        search_positions_left['filter_left'] = list(current_pos_chrom[[col + '_left' for col in filter_cols]].min(axis=1))*distance
        search_positions_left = search_positions_left.loc[search_positions_left['level_0'].abs() <= (search_positions_left['filter_left'] - filter_distance)].copy()
        del search_positions_left['filter_left']
    search_positions_left.columns = ['POS', 'relative_pos'] + useful_cols

    search_positions_right = dict()
    for pos in range(1,distance+1):
        search_positions_right[pos] = pd.DataFrame(current_pos_chrom[right_pos_col] + pos -1)
    search_positions_right = pd.concat(search_positions_right).reset_index()[[right_pos_col, 'level_0']]
    for col in useful_cols:
        search_positions_right[col] = list(current_pos_chrom[col])*distance
    if len(filter_cols) > 0:
        search_positions_right['filter_right'] = list(current_pos_chrom[[col + '_right' for col in filter_cols]].min(axis=1))*distance
        search_positions_right = search_positions_right.loc[search_positions_right['level_0'] <= (search_positions_right['filter_right'] - filter_distance)].copy()
        del search_positions_right['filter_right']
    search_positions_right.columns = ['POS', 'relative_pos'] + useful_cols

    for col in filter_cols:
        current_pos_chrom = current_pos_chrom.loc[current_pos_chrom[col + '_min'] >= filter_distance].copy()
    current_pos_chrom['pos'] = [list(range(left,right)) for left, right in zip(current_pos_chrom[left_pos_col], current_pos_chrom[right_pos_col])]
    for col in useful_cols:
        current_pos_chrom[col] = [[entry]*len(pos) for entry, pos in zip(current_pos_chrom[col], current_pos_chrom['pos'])]
    search_positions_middle = pd.DataFrame(flatten(current_pos_chrom['pos']), columns = ['POS'])
    search_positions_middle['relative_pos'] = 0
    for col in useful_cols:
        search_positions_middle[col] = pd.DataFrame(flatten(current_pos_chrom[col]))

    search_positions = pd.concat([search_positions_left, search_positions_right, search_positions_middle])
        
    # for each qc_cutoff in the mutation dataset, find mutations overlapping the search coordinates
    
    current_mut_count = dict()
    
    current_mut_chrom = input_mut_dict[chrom].copy()
    current_mut_chrom.index = current_mut_chrom.index - 1     # change coordinates from base1 to base0
    if noAC == True:
        current_mut_chrom[['A', 'T', 'G', 'C']] = (current_mut_chrom[['A', 'T', 'G', 'C']] > 0).astype(int)  
    
    if ignore_qual != False:
        current_mut_count[ignore_qual] = current_mut_chrom.reindex(search_positions['POS'])[['A', 'T', 'G', 'C']]
        current_mut_count[ignore_qual]['pos'] = list(search_positions['relative_pos'])
        current_mut_count[ignore_qual]['Tri'] = [tri_function(chrom, pos, base = 0) for pos in current_mut_count[ignore_qual].index]
        for col in useful_cols:
            current_mut_count[ignore_qual][col] = list(search_positions[col])
        current_mut_count[ignore_qual]['tri_count'] = 1
        current_mut_sum = dict()
        current_mut_sum[ignore_qual] = current_mut_count[ignore_qual].loc[current_mut_count[ignore_qual]['Tri'].isin(all_triplets)].groupby(['pos'] + useful_cols + ['Tri']).sum().copy()

    else:
        for qc_cutoff in qc_cutoff_list:
            print('\r' + str(chrom) + ' qc: ' + str(qc_cutoff[0]), end='        ')
            current_mut_qc = (current_mut_chrom[['A', 'T', 'G', 'C']]).mul((current_mut_chrom[['qual_A', 'qual_T', 'qual_G', 'qual_C']] >= qc_cutoff[0]).astype(int).values, axis=0).mul((current_mut_chrom[['inbr_A', 'inbr_T', 'inbr_G', 'inbr_C']] >= qc_cutoff[1]).astype(int).values, axis=0)
            current_mut_count[qc_cutoff[0]] = current_mut_qc.reindex(search_positions['POS'])[['A', 'T', 'G', 'C']]
            current_mut_count[qc_cutoff[0]]['pos'] = list(search_positions['relative_pos'])
            current_mut_count[qc_cutoff[0]]['Tri'] = [tri_function(chrom, pos, base = 0) for pos in current_mut_count[qc_cutoff[0]].index]
            if gc_nmer != False:
                current_mut_count[qc_cutoff[0]]['seq_'+str(gc_nmer)] = [reference_lookup(chrom, pos, round((gc_nmer-1)/2)) for pos in current_mut_count[qc_cutoff[0]].index]
                current_mut_count[qc_cutoff[0]]['seq_'+str(gc_nmer)] = current_mut_count[qc_cutoff[0]]['seq_'+str(gc_nmer)].astype(str)
                current_mut_count[qc_cutoff[0]]['GC_'+str(gc_nmer)] = (current_mut_count[qc_cutoff[0]]['seq_'+str(gc_nmer)].str.count('G') + current_mut_count[qc_cutoff[0]]['seq_'+str(gc_nmer)].str.count('C')) / (gc_nmer - current_mut_count[qc_cutoff[0]]['seq_'+str(gc_nmer)].str.count('N'))
            for col in useful_cols:
                current_mut_count[qc_cutoff[0]][col] = list(search_positions[col])
            current_mut_count[qc_cutoff[0]]['tri_count'] = 1   

        # count mutations at each relative position using groupby
        current_mut_sum = dict()
        for qc_cutoff in qc_cutoff_list:
            current_mut_sum[qc_cutoff[0]] = current_mut_count[qc_cutoff[0]].loc[current_mut_count[qc_cutoff[0]]['Tri'].isin(all_triplets)].groupby(['pos'] + useful_cols + ['Tri']).sum().copy()
        
    return current_mut_sum

In [None]:
def count_mut_flank(input_pos_df, input_mut_dict = gnomad_slim_all, distance = 500, qc_cutoff_list = [(-np.inf, -np.inf), (-2.774, -0.3), (0, -0.3), (4, -0.3)], left_pos_col = 'start', right_pos_col = 'end', chrom_col = 'chrom', strand_col = 'Strand', strand_names = ('+', '-'), filter_cols = ['RM_distance', 'STR_distance', 'nonSTR_distance', 'within_motif_distance'], filter_distance = 20, useful_cols = [], noAC = False, gc_correction_dict = gc_correction_bytri, gc_nmer = False, ignore_qual = False):
   
    current_mut_sum_chrom = dict()
    for chrom in range(chr_range,23):
        current_mut_sum_chrom[chrom] = count_mut_flank_chrom(chrom, input_pos_df.loc[input_pos_df[chrom_col] == chrom].copy(), input_mut_dict, distance = distance, left_pos_col = left_pos_col, right_pos_col = right_pos_col, filter_cols = filter_cols, filter_distance = filter_distance, useful_cols = useful_cols, qc_cutoff_list = qc_cutoff_list, noAC = noAC, gc_nmer = gc_nmer, ignore_qual = ignore_qual)
    
    current_mut_sum = dict()
    if ignore_qual != False:
        current_mut_sum[ignore_qual] = pd.concat([current_mut_sum_chrom[chrom][ignore_qual] for chrom in range(chr_range,23)])
    else:
        for qc_cutoff in qc_cutoff_list:
            current_mut_sum[qc_cutoff[0]] = pd.concat([current_mut_sum_chrom[chrom][qc_cutoff[0]] for chrom in range(chr_range,23)])

    # apply reverse complement to - strand triplets, mutation counts and positions
    if strand_col in useful_cols:
        useful_cols.remove(strand_col)
        current_mut_sum_strand_F = dict()
        current_mut_sum_strand_R = dict()
        current_mut_sum_strand_other = dict()
        current_mut_sum_bothstrands = dict()
        for qc_cutoff in current_mut_sum:
            current_mut_sum_strand_other[qc_cutoff] = current_mut_sum[qc_cutoff].loc[~current_mut_sum[qc_cutoff].index.get_level_values(strand_col).isin(strand_names)].copy()
            current_mut_sum_strand_other[qc_cutoff] = current_mut_sum_strand_other[qc_cutoff].reset_index().groupby(['pos'] + useful_cols + ['Tri']).sum()
            
            current_mut_sum_strand_F[qc_cutoff] = current_mut_sum[qc_cutoff].loc[current_mut_sum[qc_cutoff].index.get_level_values(strand_col) == strand_names[0]].copy()
            current_mut_sum_strand_F[qc_cutoff] = current_mut_sum_strand_F[qc_cutoff].reset_index().groupby(['pos'] + useful_cols + ['Tri']).sum()

            current_mut_sum_strand_R[qc_cutoff] = current_mut_sum[qc_cutoff].loc[current_mut_sum[qc_cutoff].index.get_level_values(strand_col) == strand_names[1]].copy()
            current_mut_sum_strand_R[qc_cutoff] = current_mut_sum_strand_R[qc_cutoff].reset_index()
            current_mut_sum_strand_R[qc_cutoff]['Tri'] = current_mut_sum_strand_R[qc_cutoff]['Tri'].apply(reverse_complement)
            current_mut_sum_strand_R[qc_cutoff]['pos'] = -current_mut_sum_strand_R[qc_cutoff]['pos']
            current_mut_sum_strand_R[qc_cutoff] = current_mut_sum_strand_R[qc_cutoff].groupby(['pos'] + useful_cols + ['Tri']).sum()
            if gc_nmer == False:
                current_mut_sum_strand_R[qc_cutoff].columns = ['T', 'A', 'C', 'G', 'tri_count']
                current_mut_sum_strand_R[qc_cutoff] = current_mut_sum_strand_R[qc_cutoff][['A', 'T', 'G', 'C', 'tri_count']]
            else:
                current_mut_sum_strand_R[qc_cutoff].columns = ['T', 'A', 'C', 'G', 'GC_'+str(gc_nmer), 'tri_count']
                current_mut_sum_strand_R[qc_cutoff] = current_mut_sum_strand_R[qc_cutoff][['A', 'T', 'G', 'C', 'GC_'+str(gc_nmer), 'tri_count']]
                

            current_mut_sum_bothstrands[qc_cutoff] = current_mut_sum_strand_F[qc_cutoff].add(current_mut_sum_strand_R[qc_cutoff], fill_value = 0).add(current_mut_sum_strand_other[qc_cutoff], fill_value = 0)
    else:
        current_mut_sum_bothstrands = dict()
        for qc_cutoff in current_mut_sum:
            current_mut_sum_bothstrands[qc_cutoff] = current_mut_sum[qc_cutoff].reset_index().groupby(['pos'] + useful_cols + ['Tri']).sum().fillna(0)
    
    if gc_nmer != False:
        for qc_cutoff in current_mut_sum:
            current_mut_sum_bothstrands[qc_cutoff]['GC_'+str(gc_nmer)] = current_mut_sum_bothstrands[qc_cutoff]['GC_'+str(gc_nmer)] / current_mut_sum_bothstrands[qc_cutoff]['tri_count']
    
    #return current_mut_sum_bothstrands

    # reformat output to NNN_N rows x pos columns, and split mut counts and trinucleotide counts
    
    current_tri_sum = current_mut_sum_bothstrands[qc_cutoff].reset_index().groupby(['pos', 'Tri']).sum().unstack().transpose().fillna(0).astype(int).loc['tri_count'].reindex(all_triplets).copy()

    current_mut_sum_reformat = dict()
    for qc_cutoff in current_mut_sum:
        current_mut_sum_reformat[qc_cutoff] = current_mut_sum_bothstrands[qc_cutoff].reset_index().groupby(['pos', 'Tri']).sum()[['A', 'T', 'G', 'C']].unstack().transpose()
        current_mut_sum_reformat[qc_cutoff].index = current_mut_sum_reformat[qc_cutoff].index.get_level_values('Tri') + '_' + current_mut_sum_reformat[qc_cutoff].index.get_level_values(0)
        current_mut_sum_reformat[qc_cutoff] = current_mut_sum_reformat[qc_cutoff].reindex(triplet_mutations_und)
        current_mut_sum_reformat[qc_cutoff].index.name = 'Mut'
        
    if gc_nmer == False:
        return current_mut_sum_reformat.copy(), current_tri_sum.copy()

    # GC window correction

    else:
        current_mut_sum_GCcorrect = dict()
        
        current_gc_by_pos = current_mut_sum_bothstrands[qc_cutoff].reset_index().groupby(['pos']).sum()['GC_'+str(gc_nmer)]
        
        for qc_cutoff in current_mut_sum:
            # reformat output
            current_mut_sum_GCcorrect[qc_cutoff] = current_mut_sum_bothstrands[qc_cutoff].reset_index().groupby(['pos', 'GC_'+str(gc_nmer), 'Tri']).sum()[['A', 'T', 'G', 'C']].unstack().transpose()
            current_mut_sum_GCcorrect[qc_cutoff].index = current_mut_sum_GCcorrect[qc_cutoff].index.get_level_values('Tri') + '_' + current_mut_sum_GCcorrect[qc_cutoff].index.get_level_values(0)
            current_mut_sum_GCcorrect[qc_cutoff] = current_mut_sum_GCcorrect[qc_cutoff].reindex(triplet_mutations_und)
            current_mut_sum_GCcorrect[qc_cutoff].index.name = 'Mut'
            # apply GC correction    
            current_mut_sum_GCcorrect[qc_cutoff] = current_mut_sum_GCcorrect[qc_cutoff].transpose().fillna(0)
            current_mut_sum_GCcorrect[qc_cutoff] = current_mut_sum_GCcorrect[qc_cutoff].mul(np.array(gc_correction_dict[qc_cutoff].iloc[np.searchsorted(gc_correction_dict[qc_cutoff].index, current_mut_sum_GCcorrect[qc_cutoff].index.get_level_values('GC_'+str(gc_nmer)))])).set_index(current_mut_sum_GCcorrect[qc_cutoff].index.get_level_values('pos'))
            current_mut_sum_GCcorrect[qc_cutoff] = current_mut_sum_GCcorrect[qc_cutoff].groupby(current_mut_sum_GCcorrect[qc_cutoff].index).sum().transpose() 
                
        return current_mut_sum_reformat.copy(), current_tri_sum.copy(), current_mut_sum_GCcorrect.copy(), current_gc_by_pos

In [None]:
# Calculate normalized mutation frequency and binomial proportion confidence intervals
def mut_norm_conf(count_input, noAC = False, genome_AC_freq_current = genome_AC_freq_all, genome_count_freq_current = None, n_genomes = gnomad_n_genomes, min_count = 25, normtorandom = False, random_normaverage = None, tri_subset = triplet_mutations_und, do_binconf = True, output_div = False, summary_cols = False, gc_correct = False, rolling = False, window_size = 10, snvindel = 'snv'):
    mut_count_current = dict(); mut_freq = dict(); mut_div = dict(); mut_norm = dict(); mut_weights = dict(); mut_binconf_low = dict(); mut_binconf_high = dict(); mut_freq_expected = dict()

    tri_count_in = count_input[1].copy()
    if gc_correct == False:
        mut_count_in = count_input[0].copy()
    else:
        mut_count_in = count_input[2].copy()
    
    if noAC == False:
        genome_mut_frequencies = genome_AC_freq_current.reindex(tri_subset)
    else:
        genome_mut_frequencies = genome_count_freq_current.reindex(tri_subset)
        
    # Rolling windows
    if rolling == True:
        tri_count_in = tri_count_in.rolling(window_size, center=True, axis=1).sum()
    
    # Put trinucleotide counts in trinucleotide mutation format
    if snvindel == 'snv':
        tri_count_mut = tri_count_in.loc[flatten([[mut]*3 for mut in all_triplets])].copy()
        tri_count_mut.index = triplet_mutations_und
    if snvindel == 'indel':
        tri_count_mut = pd.concat([tri_count_in, tri_count_in])
        tri_count_mut.index = triplet_mutations_und_indel
    tri_count_mut = tri_count_mut.reindex(tri_subset)
    if summary_cols == True:
        tri_count_mut.columns = tri_count_mut.columns.astype(str)
    # Filter out positions with very low trinucleotide counts
    tri_count_mut_sum = tri_count_mut.sum(axis=0)
    tri_count_mut_sum = tri_count_mut_sum.reindex(mut_count_in[list(mut_count_in)[0]].columns)
    tri_count_mut = tri_count_mut[tri_count_mut_sum.loc[tri_count_mut_sum > (min_count*3)].index]
    
    for qc_cutoff in mut_count_in:
        mut_count_current[qc_cutoff] = mut_count_in[qc_cutoff].reindex(tri_subset).copy()
        if summary_cols == True:
            mut_count_current[qc_cutoff].columns = mut_count_current[qc_cutoff].columns.astype(str)
        # Rolling windows
        if rolling == True:
            mut_count_current[qc_cutoff] = mut_count_current[qc_cutoff].rolling(window_size, center=True, axis=1).sum()
        # Filter out positions with very low trinucleotide counts
        mut_count_current[qc_cutoff] = mut_count_current[qc_cutoff][tri_count_mut.columns].fillna(0)#.sort_index()
        # Calculate mutation frequencies
        mut_freq[qc_cutoff] = mut_count_current[qc_cutoff].div(tri_count_mut) / n_genomes
        mut_div[qc_cutoff] = mut_freq[qc_cutoff].div(genome_mut_frequencies[qc_cutoff], axis=0)#.reindex(tri_subset)
        # Calculate weighted average of mutation frequencies based on trinucleotide counts        
        mut_norm[qc_cutoff] =  pd.Series(np.ma.average(np.ma.MaskedArray(mut_div[qc_cutoff], mask = np.isnan(tri_count_mut.replace(0,np.nan))), weights = n_genomes * tri_count_mut.astype(np.uint64), axis=0), index = mut_div[qc_cutoff].columns)
        if normtorandom == True:
            mut_norm[qc_cutoff] = mut_norm[qc_cutoff] / random_normaverage[qc_cutoff]
        # Expected mutation frequency based on trinucleotide counts
        genome_mut_freq_reshape = np.array(list(genome_mut_frequencies[qc_cutoff])*len(tri_count_mut.columns))
        genome_mut_freq_reshape.shape = (len(tri_count_mut.columns),len(genome_mut_frequencies))
        genome_mut_freq_reshape = genome_mut_freq_reshape.transpose()
        mut_freq_expected[qc_cutoff] = pd.Series(np.ma.average(np.ma.MaskedArray(genome_mut_freq_reshape, mask = np.isnan(tri_count_mut.replace(0,np.nan))), weights = n_genomes * tri_count_mut.astype(np.uint64), axis=0), index = mut_div[qc_cutoff].columns)
        
        if do_binconf == True:
            # Calculate binomial proportion confidence intervals per trincleotide mutation
            binconf_current = binconf(mut_count_current[qc_cutoff].sum(), (tri_count_mut.sum()) * n_genomes)
            mut_binconf_low[qc_cutoff] = pd.Series(binconf_current[0], index = mut_freq[qc_cutoff].columns)
            mut_binconf_high[qc_cutoff] = pd.Series(binconf_current[1], index = mut_freq[qc_cutoff].columns)
            # Calculate weighted average of binomial proportion confidence intervals based on trinucleotide counts
            mut_weights[qc_cutoff] = pd.Series(np.ma.average(np.ma.MaskedArray(mut_freq[qc_cutoff], mask = np.isnan(tri_count_mut.replace(0,np.nan))), weights = n_genomes * tri_count_mut.astype(np.uint64), axis=0), index = mut_freq[qc_cutoff].columns)
                
            mut_binconf_low[qc_cutoff] = mut_norm[qc_cutoff] * (mut_binconf_low[qc_cutoff] / mut_weights[qc_cutoff])
            mut_binconf_high[qc_cutoff] = mut_norm[qc_cutoff] * (mut_binconf_high[qc_cutoff] / mut_weights[qc_cutoff])

    if output_div == True:
        return mut_div.copy(), tri_count_mut.copy(), mut_count_current.copy(), mut_freq.copy(), genome_mut_frequencies.copy()
    else:        
        if do_binconf == True:
            return mut_norm.copy(), mut_binconf_low.copy(), mut_binconf_high.copy(), mut_freq_expected.copy()
        else:
            return mut_norm.copy(), mut_freq_expected.copy()

## Count and analyze flanking mutations and trinucleotides <a name="mutation_surrounding_analysis"></a>
    Make new directory "./analysis/temp/" for output of mutation counts

[Return to Table of Contents](#TOC)

### Random sequences <a name="mutation_surrounding_analysis_random"></a>
- Measure mutation frequency surrounding random non-motif sequences 
- Used to normalize for the effect of uneven distribution of low quality scores 

[Return to Table of Contents](#TOC)

#### Load

In [None]:
random_seq = pd.read_csv('./custom_db/random_sequences_set1_chr1-22.csv.gz', compression = 'gzip')
random_seq_set2 = pd.read_csv('./custom_db/random_sequences_set2_chr1-22.csv.gz', compression = 'gzip')
random_seq = pd.concat([random_seq, random_seq_set2])
# assign random strand
random_seq['Strand'] = np.random.randint(0,2, size=len(random_seq))
random_seq['Strand'] = random_seq['Strand'].replace(0, '-').replace(1,'+')
random_seq = distance_within_df(random_seq, 'within_motif')
random_seq = random_seq.drop_duplicates(subset = ['chrom', 'start']).copy()
 
random_seq = random_seq.loc[random_seq[['STR_distance_min', 'nonSTR_distance_min', 'within_motif_distance_min', 'RM_distance_min']].fillna(0).min(axis=1) > 20].copy()

In [None]:
# Larger set of random positions

random_seq_larger = pd.read_csv('./custom_db/random_sequences_set3_chr1-22.csv.gz', compression = 'gzip')
random_seq_larger_set2 = pd.read_csv('./custom_db/random_sequences_set4_chr1-22.csv.gz', compression = 'gzip')
random_seq_larger = pd.concat([random_seq_larger, random_seq_larger_set2])
# assign random strand
random_seq_larger['Strand'] = np.random.randint(0,2, size=len(random_seq_larger))
random_seq_larger['Strand'] = random_seq_larger['Strand'].replace(0, '-').replace(1,'+')
random_seq_larger = distance_within_df(random_seq_larger, 'within_motif')
random_seq_larger = random_seq_larger.drop_duplicates(subset = ['chrom', 'start']).copy()

random_seq_larger = random_seq_larger.loc[random_seq_larger[['STR_distance_min', 'nonSTR_distance_min', 'within_motif_distance_min', 'RM_distance_min']].fillna(0).min(axis=1) > 20].copy()

#### Count flanking mutations and trinucleotides, then normalize

In [None]:
count_random_all = count_mut_flank(random_seq, useful_cols = ['Strand'], gc_nmer = 51)

norm_random_all = mut_norm_conf(count_random_all, gc_correct=True)

In [None]:
# Median SNV frequencies for each QC filter, used to normalize all later mutation frequencies
normtorandom_all = pd.Series([norm_random_all[0][qc_cutoff].loc[-50:50].median() for qc_cutoff in norm_random_all[0]], index = list(norm_random_all[0]))
normtorandom_all

In [None]:
norm_random_simple_freq = count_random_all[0][-np.inf].sum() / count_random_all[1].sum()
normtorandom_simplefreq = norm_random_simple_freq.loc[-50:50].median()
normtorandom_simplefreq

In [None]:
norm_random_all_normtorandom = mut_norm_conf(count_random_all, normtorandom = True, random_normaverage = normtorandom_all, gc_correct=True)

#### Plot mutation frequency surrounding random sequences
- not used in paper, only used to confirm flat line

In [None]:
vqslod_list = [-np.inf, -2.774, 0, 4]
QC_colors = make_colorscale(vqslod_list)
default_colors = make_default_colors(vqslod_list)

In [None]:
mutnorm_all_binconf_fig = go.Figure()
for qc_cutoff in vqslod_list:
    mutnorm_all_binconf_fig.add_trace(go.Scatter(name = qc_cutoff, legendgroup = qc_cutoff, showlegend=False, fill=None, x = norm_random_all[2][qc_cutoff].dropna().index, y = norm_random_all[2][qc_cutoff].dropna(), mode = 'lines', line = dict(width = 0, color = QC_colors[0][qc_cutoff])))
    mutnorm_all_binconf_fig.add_trace(go.Scatter(name = qc_cutoff, legendgroup = qc_cutoff, fill='tonexty', fillcolor = QC_colors[1][qc_cutoff], x = norm_random_all[0][qc_cutoff].dropna().index, y = norm_random_all[0][qc_cutoff].dropna(), mode = 'lines', line = dict(width = 2, color = QC_colors[0][qc_cutoff])))
    mutnorm_all_binconf_fig.add_trace(go.Scatter(name = qc_cutoff, legendgroup = qc_cutoff, showlegend=False, fill='tonexty', fillcolor = QC_colors[1][qc_cutoff], x = norm_random_all[1][qc_cutoff].dropna().index, y = norm_random_all[1][qc_cutoff].dropna(), mode = 'lines', line = dict(width = 0, color = QC_colors[0][qc_cutoff])))
mutnorm_all_binconf_fig.update_layout(title = 'random')
mutnorm_all_binconf_fig.update_xaxes(range = [-500,500])
#mutnorm_all_binconf_fig.update_yaxes(range = [0,pd.concat(norm_random_all[0], axis=1).max(axis=1).loc[-250:250].max()])
mutnorm_all_binconf_fig.update_yaxes(zeroline = False)
mutnorm_all_binconf_fig.show()

### Repeat motifs <a name="mutation_surrounding_analysis_repeat"></a>
- Measure mutation frequency surrounding repeat motif sequences 

[Return to Table of Contents](#TOC)

#### Load motif database and define categories to analyze <a name="mutation_surrounding_analysis_categories"></a>

In [None]:
# Load motif database
all_motifs_unique = pd.read_pickle('./custom_db/all_motifs_unique_chr'+str(chr_range)+'-22.pickle')

In [None]:
# STR motifs with enough power to analyze, symmetric and asymmetric STR motifs
repeats_highpower_asym = pd.Series([reverse_complement(repeat) for repeat in ['A', 'C',  'AC', 'AG', 'ACC', 'AAT', 'AAC', 'AAG', 'AGG', 'ATC', 'AGC', 'AAAT']], index = ['A', 'C',  'AC', 'AG', 'ACC', 'AAT', 'AAC', 'AAG', 'AGG', 'ATC', 'AGC', 'AAAT'])
repeats_highpower_sym = ['AT', 'GC']
repeats_highpower = list(repeats_highpower_asym.index) + list(repeats_highpower_asym) + repeats_highpower_sym

# All frames for each STR motif
def repeat_variations(current_repeat):
    fwd_repeats = [current_repeat[start:] + current_repeat[:start - len(current_repeat)] for start in range(len(current_repeat))]
    rc_repeats = [reverse_complement(repeat) for repeat in fwd_repeats]
    return fwd_repeats, rc_repeats, fwd_repeats + rc_repeats
repeats_highpower_allframes = list(set(flatten([repeat_variations(repeat)[2] for repeat in repeats_highpower])))

# Restrict STR database to highpower motifs
all_STRs_unique = all_motifs_unique.loc[(all_motifs_unique['Type'] == 'STR') & (all_motifs_unique['repeat'].isin(repeats_highpower_allframes))].dropna(axis = 1, how = 'all').copy()

# Fix annoying naming of repeat frames which were not necesarily paired reverse complements
all_STRs_unique['repeat'] = [repeat if repeat in repeats_highpower else repeat_variations(repeat)[0][1] if repeat_variations(repeat)[0][1] in repeats_highpower else repeat_variations(repeat)[0][2] if repeat_variations(repeat)[0][2] in repeats_highpower else repeat_variations(repeat)[0][3] if repeat_variations(repeat)[0][3] in repeats_highpower else np.nan for repeat in all_STRs_unique['repeat']]

In [None]:
# Remove STRs with interruptions
perfect_STRs = all_STRs_unique.loc[all_STRs_unique['status'] == 'perfect'].copy()

# Remove very short STRs
long_STRs = dict()
for repeat in repeats_highpower:
    long_STRs[repeat] = perfect_STRs.loc[(perfect_STRs['repeat'].isin(repeat_variations(repeat)[2])) & (perfect_STRs['length'] > perfect_STRs.loc[(perfect_STRs['repeat'].isin(repeat_variations(repeat)[2]))]['length'].quantile(0.8))].copy()

In [None]:
# Inverted repeats
all_IRs = all_motifs_unique.loc[(all_motifs_unique['Type'] == 'IR')].dropna(axis = 1, how = 'all').copy()
long_IRs = all_IRs.loc[all_IRs['stem_len'] > all_IRs['stem_len'].quantile(0.8)].copy()
very_long_IRs = all_IRs.loc[all_IRs['stem_len'] > all_IRs['stem_len'].quantile(0.95)].copy()

# Inverted repeats binned by on G/C content
AT80_IRs = long_IRs.loc[(long_IRs['Sequence'].str.count('A') + long_IRs['Sequence'].str.count('T'))  / long_IRs['length'] > 0.8].copy()
AT40_IRs = long_IRs.loc[(long_IRs['Sequence'].str.count('A') + long_IRs['Sequence'].str.count('T'))  / long_IRs['length'] < 0.4].copy()

# IRs with shorter loop length
long_IRs_loop10 = long_IRs.loc[long_IRs['spacer'] < 11]

# Mirror repeats
all_MRs = all_motifs_unique.loc[(all_motifs_unique['Type'] == 'MR')].dropna(axis = 1, how = 'all').copy()
long_MRs = all_MRs.loc[all_MRs['stem_len'] > all_MRs['stem_len'].quantile(0.8)].copy()
very_long_MRs = all_MRs.loc[all_MRs['stem_len'] > all_MRs['stem_len'].quantile(0.95)].copy()

# Mirror repeats binned by homopurine/homopyrimidine status
non_homopurine_MRs = all_MRs.loc[~(((all_MRs['seq_L'].str.count('A') + all_MRs['seq_L'].str.count('G') == 0) & (all_MRs['seq_R'].str.count('A') + all_MRs['seq_R'].str.count('G') == 0)) | ((all_MRs['seq_L'].str.count('T') + all_MRs['seq_L'].str.count('C') == 0) & (all_MRs['seq_R'].str.count('T') + all_MRs['seq_R'].str.count('C') == 0)))].copy()
homopurine_MRs = dict()
homopurine_MRs['+'] = all_MRs.loc[(all_MRs['seq_L'].str.count('A') + all_MRs['seq_L'].str.count('G') == 0) & (all_MRs['seq_R'].str.count('A') + all_MRs['seq_R'].str.count('G') == 0)].copy()
homopurine_MRs['-'] = all_MRs.loc[(all_MRs['seq_L'].str.count('T') + all_MRs['seq_L'].str.count('C') == 0) & (all_MRs['seq_R'].str.count('T') + all_MRs['seq_R'].str.count('C') == 0)].copy()
homopurine_MRs['+']['Strand'] = '+'; homopurine_MRs['+']['Strand'] = '-'
homopurine_MRs = pd.concat(homopurine_MRs); homopurine_MRs.index = homopurine_MRs.index.get_level_values(1)
perfect_homopurine_MRs = homopurine_MRs.loc[homopurine_MRs['#MM'] == 0].copy()

# MRs with shorter loop length
long_MRs_loop10 = long_MRs.loc[long_MRs['spacer'] < 11]

# Direct repeats
all_DRs = all_motifs_unique.loc[(all_motifs_unique['Type'] == 'DR')].dropna(axis = 1, how = 'all').copy()
long_DRs = all_DRs.loc[all_DRs['stem_len'] > all_DRs['stem_len'].quantile(0.8)].copy()
very_long_DRs = all_DRs.loc[all_DRs['stem_len'] > all_DRs['stem_len'].quantile(0.95)].copy()
perfect_long_DRs = long_DRs.loc[long_DRs['#MM'] ==0].copy()

# DRs with shorter loop length
long_DRs_loop10 = long_DRs.loc[long_DRs['spacer'] < 11]

# Z-DNA motifs
all_ZDNAs = all_motifs_unique.loc[(all_motifs_unique['Type'] == 'ZDNA')].dropna(axis = 1, how = 'all').copy()
long_ZDNAs = all_ZDNAs.loc[all_ZDNAs['length'] > all_ZDNAs['length'].quantile(0.8)].copy()
ZDNAs_GY = all_ZDNAs.dropna(subset = ['Strand'])

# G4 motifs
all_G4s = all_motifs_unique.loc[all_motifs_unique['Type'] == 'G4'].dropna(axis = 1, how = 'all').copy()
K_G4s = all_G4s.loc[all_G4s['status'].isin(['K+', 'both'])].copy()
PDS_G4s = all_G4s.loc[all_G4s['status'] == 'PDS'].copy()

### CG repeats in/outside of CpG islands <a name="mutation_surrounding_CGI"></a>

#### Download CpG island map <a name="mutation_surrounding_CGI_download"></a>
- Download CpG island map from UCSC Table Browser: https://genome.ucsc.edu/cgi-bin/hgTables
- Select options:
    - assembly: Dec 2013 (GRCh38/hg38)
    - group: Regulation
    - track: CpG Islands
    - region: genome
    - output format: all fields from selected table
    - output filename: hg38_cpgislands.bed
    - file type returned: gzip compressed
- Place .bed.gz file in subfolder './hg38/'

#### Measure distance between CG motifs and CpG islands <a name="mutation_surrounding_CGI_distance"></a>

[Return to Table of Contents](#TOC)

In [None]:
cpgislands = pd.read_csv('./hg38/hg38_cpgislands.bed.gz', compression = 'gzip', sep = '\t', usecols = [1,2,3])
cpgislands['chrom'] = [chrom[3:] for chrom in cpgislands['chrom']]
cpgislands = cpgislands.loc[cpgislands['chrom'].isin([str(chrom) for chrom in range(chr_range,23)])]
cpgislands['chrom'] = cpgislands['chrom'].astype(int)
cpgislands.columns = ['chrom', 'start', 'end']
# CpG island "shores" are within 2kb of CpG islands
cpgislands['start'] = cpgislands['start'] - 2000
cpgislands['end'] = cpgislands['end'] + 2000

In [None]:
# Measure distance between CG motifs and CpG islands
STRs_CG = measure_distance(long_STRs['GC'].copy(), cpgislands, 'CGI')
STRs_CG = STRs_CG.loc[STRs_CG['CGI_distance_min'] > 0]

### Count mutations flanking repeats

#### STRs <a name="mutation_surrounding_analysis_str"></a>

[Return to Table of Contents](#TOC)

In [None]:
count_long_STRs = dict()

for repeat in repeats_highpower_asym.index:
    print('\r' + '                     ' + repeat, end='  ')
    count_long_STRs[repeat] = count_mut_flank(long_STRs[repeat], useful_cols = ['Strand'], gc_nmer = 51)
for repeat in repeats_highpower_sym:
    print('\r' + '                     ' + repeat, end='  ')
    count_long_STRs[repeat] = count_mut_flank(long_STRs[repeat], gc_nmer = 51)

count_long_STRs['GC_noCGI'] = count_mut_flank(STRs_CG, filter_cols = ['RM_distance', 'STR_distance', 'nonSTR_distance', 'within_motif_distance', 'CGI_distance'], gc_nmer = 51)

In [None]:
norm_long_STRs = dict()
for repeat in count_long_STRs:
    print('\r' + '                               ' + repeat, end='  ')
    norm_long_STRs[repeat] = mut_norm_conf(count_long_STRs[repeat], normtorandom = True, random_normaverage = normtorandom_all, do_binconf = True, gc_correct = True)

norm_long_STRs['GC_noCGI'] = mut_norm_conf(count_long_STRs['GC_noCGI'], min_count = 10, normtorandom = True, random_normaverage = normtorandom_all, do_binconf = True, gc_correct = True)

#### Inverted repeats <a name="mutation_surrounding_analysis_ir"></a>

[Return to Table of Contents](#TOC)

In [None]:
count_IRs = dict()
for motif, name in zip([long_IRs, very_long_IRs, AT80_IRs, AT40_IRs, long_IRs_loop10], ['long_IRs', 'very_long_IRs', 'AT80_IRs', 'AT40_IRs', 'long_IRs_loop10']):
    print('\r' + '                     ' + name, end='  ')
    count_IRs[name] = count_mut_flank(motif, gc_nmer = 51)

In [None]:
norm_IRs = dict()
for name in count_IRs:
    print('\r' + '                               ' + name, end='  ')
    norm_IRs[name] = mut_norm_conf(count_IRs[name], normtorandom = True, random_normaverage = normtorandom_all, do_binconf = True, gc_correct = True)

#### Mirror repeats <a name="mutation_surrounding_analysis_mr"></a>

[Return to Table of Contents](#TOC)

In [None]:
count_MRs = dict()
for motif, name in zip([long_MRs, very_long_MRs, non_homopurine_MRs, long_MRs_loop10], ['long_MRs', 'very_long_MRs', 'non_homopurine_MRs', 'long_MRs_loop10']):
    print('\r' + '                     ' + name, end='  ')
    count_MRs[name] = count_mut_flank(motif, gc_nmer = 51)
for motif, name in zip([homopurine_MRs], ['homopurine_MRs']):
    print('\r' + '                     ' + name, end='  ')
    count_MRs[name] = count_mut_flank(motif, useful_cols = ['Strand'], gc_nmer = 51)

In [None]:
norm_MRs = dict()
for name in count_MRs:
    print('\r' + '                               ' + name, end='  ')
    norm_MRs[name] = mut_norm_conf(count_MRs[name], normtorandom = True, random_normaverage = normtorandom_all, do_binconf = True, gc_correct = True)

#### Direct repeats <a name="mutation_surrounding_analysis_dr"></a>

[Return to Table of Contents](#TOC)

In [None]:
count_DRs = dict()
for motif, name in zip([long_DRs, very_long_DRs, perfect_long_DRs, long_DRs_loop10], ['long_DRs', 'very_long_DRs', 'perfect_long_DRs', 'long_DRs_loop10']):
    print('\r' + '                     ' + name, end='  ')
    count_DRs[name] = count_mut_flank(motif, gc_nmer = 51)

In [None]:
norm_DRs = dict()
for name in count_DRs:
    print('\r' + '                               ' + name, end='  ')
    norm_DRs[name] = mut_norm_conf(count_DRs[name], normtorandom = True, random_normaverage = normtorandom_all, do_binconf = True, gc_correct = True)

#### Z-DNA motifs <a name="mutation_surrounding_analysis_zdna"></a>

[Return to Table of Contents](#TOC)

In [None]:
count_ZDNAs = dict()
for motif, name in zip([all_ZDNAs, long_ZDNAs], ['all_ZDNAs', 'long_ZDNAs']):
    print('\r' + '                     ' + name, end='  ')
    count_ZDNAs[name] = count_mut_flank(motif, gc_nmer = 51)
for motif, name in zip([ZDNAs_GY], ['ZDNAs_GY']):
    print('\r' + '                     ' + name, end='  ')
    count_ZDNAs[name] = count_mut_flank(motif, useful_cols = ['Strand'], strand_names = ('G', 'C'), gc_nmer = 51)

In [None]:
norm_ZDNAs = dict()
for name in count_ZDNAs:
    print('\r' + '                               ' + name, end='  ')
    norm_ZDNAs[name] = mut_norm_conf(count_ZDNAs[name], normtorandom = True, random_normaverage = normtorandom_all, do_binconf = True, gc_correct = True)

#### G4 motifs <a name="mutation_surrounding_analysis_g4"></a>

[Return to Table of Contents](#TOC)

In [None]:
count_G4s = dict()
for motif, name in zip([K_G4s, PDS_G4s], ['K_G4s', 'PDS_G4s']):
    print('\r' + '                     ' + name, end='  ')
    count_G4s[name] = count_mut_flank(motif, useful_cols = ['Strand'], gc_nmer = 51)

In [None]:
norm_G4s = dict()
for name in count_G4s:
    print('\r' + '                               ' + name, end='  ')
    norm_G4s[name] = mut_norm_conf(count_G4s[name], normtorandom = True, random_normaverage = normtorandom_all, do_binconf = True, gc_correct = True)

### Save / load mutation counts  <a name="mutation_surrounding_analysis_saveload"></a>

[Return to Table of Contents](#TOC)

#### Save/load mutation counts

In [None]:
# Save temporary output of the mutation counts
with open('./analysis/temp/long_STRs_flank_ACcorrect_mutcount_chr'+str(chr_range)+'-22.pickle', 'wb') as handle:
    pickle.dump(count_long_STRs, handle, protocol=pickle.HIGHEST_PROTOCOL)
with open('./analysis/temp/IRs_flank_ACcorrect_mutcount_chr'+str(chr_range)+'-22.pickle', 'wb') as handle:
    pickle.dump(count_IRs, handle, protocol=pickle.HIGHEST_PROTOCOL)
with open('./analysis/temp/DRs_flank_ACcorrect_mutcount_chr'+str(chr_range)+'-22.pickle', 'wb') as handle:
    pickle.dump(count_DRs, handle, protocol=pickle.HIGHEST_PROTOCOL)
with open('./analysis/temp/MRs_flank_ACcorrect_mutcount_chr'+str(chr_range)+'-22.pickle', 'wb') as handle:
    pickle.dump(count_MRs, handle, protocol=pickle.HIGHEST_PROTOCOL)
with open('./analysis/temp/ZDNAs_flank_ACcorrect_mutcount_chr'+str(chr_range)+'-22.pickle', 'wb') as handle:
    pickle.dump(count_ZDNAs, handle, protocol=pickle.HIGHEST_PROTOCOL)
with open('./analysis/temp/G4s_flank_ACcorrect_mutcount_chr'+str(chr_range)+'-22.pickle', 'wb') as handle:
    pickle.dump(count_G4s, handle, protocol=pickle.HIGHEST_PROTOCOL)
with open('./analysis/temp/random_flank_ACcorrect_mutcount_chr'+str(chr_range)+'-22.pickle', 'wb') as handle:
    pickle.dump(count_random_all, handle, protocol=pickle.HIGHEST_PROTOCOL)

In [None]:
# Load temporary output of the mutation counts
with open('./analysis/temp/long_STRs_flank_ACcorrect_mutcount_chr'+str(chr_range)+'-22.pickle', 'rb') as handle:
    count_long_STRs = pickle.load(handle)
with open('./analysis/temp/IRs_flank_ACcorrect_mutcount_chr'+str(chr_range)+'-22.pickle', 'rb') as handle:
    count_IRs = pickle.load(handle)
with open('./analysis/temp/DRs_flank_ACcorrect_mutcount_chr'+str(chr_range)+'-22.pickle', 'rb') as handle:
    count_DRs = pickle.load(handle)
with open('./analysis/temp/MRs_flank_ACcorrect_mutcount_chr'+str(chr_range)+'-22.pickle', 'rb') as handle:
    count_MRs = pickle.load(handle)
with open('./analysis/temp/ZDNAs_flank_ACcorrect_mutcount_chr'+str(chr_range)+'-22.pickle', 'rb') as handle:
    count_ZDNAs = pickle.load(handle)
with open('./analysis/temp/G4s_flank_ACcorrect_mutcount_chr'+str(chr_range)+'-22.pickle', 'rb') as handle:
    count_G4s = pickle.load(handle)
with open('./analysis/temp/random_flank_ACcorrect_mutcount_chr'+str(chr_range)+'-22.pickle', 'rb') as handle:
    count_random_all = pickle.load(handle)

#### Save/load normalized mutation rates

In [None]:
# Save temporary output of the mutation counts

flanking_norm_all = dict()
flanking_norm_all['STR'] = norm_long_STRs.copy()
flanking_norm_all['IR'] = norm_IRs.copy()
flanking_norm_all['DR'] = norm_DRs.copy()
flanking_norm_all['MR'] = norm_MRs.copy()
flanking_norm_all['ZDNA'] = norm_ZDNAs.copy()
flanking_norm_all['G4'] = norm_G4s.copy()
flanking_norm_all['random'] = {'random': norm_random_all_normtorandom}.copy()
flanking_norm_all['random_nonorm'] = {'random_nonorm': norm_random_all}.copy()

with open('./analysis/temp/flank_norm_all_ACcorrect_GCcorrected_chr'+str(chr_range)+'-22.pickle', 'wb') as handle:
    pickle.dump(flanking_norm_all, handle, protocol=pickle.HIGHEST_PROTOCOL)

In [None]:
# Load temporary output of the mutation counts
with open('./analysis/temp/flank_norm_all_ACcorrect_GCcorrected_chr'+str(chr_range)+'-22.pickle', 'rb') as handle:
    flanking_norm_all = pickle.load(handle)

# Median SNV frequencies for each QC filter, used to normalize all later mutation frequencies
normtorandom_all = pd.Series([flanking_norm_all['random_nonorm']['random_nonorm'][0][qc_cutoff].loc[-50:50].median() for qc_cutoff in flanking_norm_all['random']['random'][0]], index = list(flanking_norm_all['random']['random'][0]))
norm_random_simple_freq = count_random_all[0][-np.inf].sum() / count_random_all[1].sum()
normtorandom_simplefreq = norm_random_simple_freq.loc[-50:50].median()

## Count de novo mutations surrounding motifs  <a name="mutation_surrounding_denovo"></a>

[Return to Table of Contents](#TOC)

#### Random sequences

In [None]:
count_random_denovo = count_mut_flank(random_seq_larger, useful_cols = ['Strand'], input_mut_dict = denovo_slim_reformat, ignore_qual = 'denovo')

norm_random_denovo = mut_norm_conf(count_random_denovo, genome_AC_freq_current = denovo_freq_all, n_genomes = denovo_n_genomes)

In [None]:
# Median SNV frequencies for each QC filter, used to normalize all later mutation frequencies
normtorandom_denovo = pd.Series([norm_random_denovo[0][qc_cutoff].loc[-50:50].median() for qc_cutoff in norm_random_denovo[0]], index = list(norm_random_denovo[0]))
normtorandom_denovo

norm_random_simple_freq = count_random_denovo[0]['denovo'].sum() / count_random_denovo[1].sum()
normtorandom_simplefreq = norm_random_simple_freq.loc[-50:50].median()
normtorandom_simplefreq

norm_random_denovo_normtorandom = mut_norm_conf(count_random_denovo, normtorandom = True, n_genomes = denovo_n_genomes, random_normaverage = normtorandom_denovo, genome_AC_freq_current = denovo_freq_all, rolling = 10)

#### STRs

In [None]:
count_long_STRs_denovo = dict()

for repeat in ['A', 'C', 'AG', 'AC']:
    print('\r' + '                     ' + repeat, end='  ')
    count_long_STRs_denovo[repeat] = count_mut_flank(long_STRs[repeat], useful_cols = ['Strand'], input_mut_dict = denovo_slim_reformat, ignore_qual = 'denovo')

for repeat in ['AT']:
    print('\r' + '                     ' + repeat, end='  ')
    count_long_STRs_denovo[repeat] = count_mut_flank(long_STRs[repeat], input_mut_dict = denovo_slim_reformat, ignore_qual = 'denovo')

In [None]:
norm_long_STRs_denovo = dict()
for repeat in count_long_STRs_denovo:
    print('\r' + '                               ' + repeat, end='  ')
    norm_long_STRs_denovo[repeat] = mut_norm_conf(count_long_STRs_denovo[repeat], normtorandom = True, random_normaverage = normtorandom_denovo, do_binconf = True, genome_AC_freq_current = denovo_freq_all, n_genomes = denovo_n_genomes, rolling=True)

#### G4 motifs <a name="mutation_surrounding_analysis_g4"></a>

In [None]:
count_G4s_denovo = dict()
count_G4s_denovo['K_G4s'] = count_mut_flank(K_G4s, useful_cols = ['Strand'], input_mut_dict = denovo_slim_reformat, ignore_qual = 'denovo')

In [None]:
norm_G4s_denovo = dict()
for name in count_G4s_denovo:
    print('\r' + '                               ' + name, end='  ')
    norm_G4s_denovo[name] = mut_norm_conf(count_G4s_denovo[name], normtorandom = True,  random_normaverage = normtorandom_denovo, do_binconf = True, genome_AC_freq_current = denovo_freq_all, n_genomes = denovo_n_genomes, rolling = True)

#### Save/load normalized mutation rates

In [None]:
# Save temporary output of the mutation counts
with open('./analysis/temp/flank_count_random_denovo_chr'+str(chr_range)+'-22.pickle', 'wb') as handle:
    pickle.dump(count_random_denovo, handle, protocol=pickle.HIGHEST_PROTOCOL)
with open('./analysis/temp/flank_count_long_STRs_denovo_chr'+str(chr_range)+'-22.pickle', 'wb') as handle:
    pickle.dump(count_long_STRs_denovo, handle, protocol=pickle.HIGHEST_PROTOCOL)
with open('./analysis/temp/flank_count_G4s_denovo_chr'+str(chr_range)+'-22.pickle', 'wb') as handle:
    pickle.dump(count_G4s_denovo, handle, protocol=pickle.HIGHEST_PROTOCOL)

flanking_norm_denovo = dict()
flanking_norm_denovo['STR'] = norm_long_STRs_denovo.copy()
flanking_norm_denovo['G4'] = norm_G4s_denovo.copy()
flanking_norm_denovo['random'] = {'random': norm_random_denovo_normtorandom}.copy()
flanking_norm_denovo['random_nonorm'] = {'random_nonorm': norm_random_denovo}.copy()

with open('./analysis/temp/flank_norm_denovo_chr'+str(chr_range)+'-22.pickle', 'wb') as handle:
    pickle.dump(flanking_norm_denovo, handle, protocol=pickle.HIGHEST_PROTOCOL)

In [None]:
# Load temporary output of the mutation counts
with open('./analysis/temp/flank_count_random_denovo_chr'+str(chr_range)+'-22.pickle', 'rb') as handle:
    count_random_denovo = pickle.load(handle)
with open('./analysis/temp/flank_count_long_STRs_denovo_chr'+str(chr_range)+'-22.pickle', 'rb') as handle:
    count_long_STRs_denovo = pickle.load(handle)
with open('./analysis/temp/flank_count_G4s_denovo_chr'+str(chr_range)+'-22.pickle', 'rb') as handle:
    count_G4s_denovo = pickle.load(handle)
with open('./analysis/temp/flank_norm_denovo_chr'+str(chr_range)+'-22.pickle', 'rb') as handle:
    flanking_norm_denovo = pickle.load(handle)

# Median SNV frequencies for each QC filter, used to normalize all later mutation frequencies
normtorandom_denovo = pd.Series([flanking_norm_denovo['random_nonorm']['random_nonorm'][0]['denovo'].loc[-50:50].median() for qc_cutoff in flanking_norm_denovo['random']['random'][0]], index = list(flanking_norm_denovo['random']['random'][0]))
norm_random_simple_freq = count_random_denovo[0]['denovo'].sum() / count_random_denovo[1].sum()
normtorandom_simplefreq = norm_random_simple_freq.loc[-50:50].median()

## Perform improper analysis of NonB Database (on purpose) <a name="mutation_surrounding_nonbdb"></a>

[Return to Table of Contents](#TOC)

#### Load NonBdb

In [None]:
# Load modified Non-B DB
nonbdb = pd.read_csv('./nonbdb/nonbdb_modified_chr'+str(chr_range)+'-22.csv.gz', compression = 'gzip', low_memory = False)

# Replace NaN for text string 'NA' for counting purposes
for col in ['Type', 'length', 'repeat', 'Spacer', 'Tracts', 'Subset', 'Composition', 'Strand']:
    nonbdb[col] = nonbdb[col].replace(np.nan, 'NA')

#### Separate categories

In [None]:
# STR motifs: mono- and dinucleotide
nonbdb_all_STRs = nonbdb.loc[(nonbdb['Type'] == 'Short_Tandem_Repeat')].copy()
nonbdb_STRs = dict()

nonbdb_STRs['A'] = nonbdb_all_STRs.loc[nonbdb_all_STRs['Composition'] == '1A/0C/0G/0T'].copy()
nonbdb_STRs['A']['Strand'] = '+'
nonbdb_STRs['T'] = nonbdb_all_STRs.loc[nonbdb_all_STRs['Composition'] == '0A/0C/0G/1T'].copy()
nonbdb_STRs['T']['Strand'] = '-'
nonbdb_STRs['A'] = pd.concat([nonbdb_STRs['A'], nonbdb_STRs['T']])

nonbdb_STRs['C'] = nonbdb_all_STRs.loc[nonbdb_all_STRs['Composition'] == '0A/1C/0G/0T'].copy()
nonbdb_STRs['C']['Strand'] = '+'
nonbdb_STRs['G'] = nonbdb_all_STRs.loc[nonbdb_all_STRs['Composition'] == '0A/0C/1G/0T'].copy()
nonbdb_STRs['G']['Strand'] = '-'
nonbdb_STRs['C'] = pd.concat([nonbdb_STRs['C'], nonbdb_STRs['G']])

nonbdb_STRs['AG'] = nonbdb_all_STRs.loc[nonbdb_all_STRs['Composition'] == '1A/0C/1G/0T'].copy()
nonbdb_STRs['AG']['Strand'] = '+'
nonbdb_STRs['TC'] = nonbdb_all_STRs.loc[nonbdb_all_STRs['Composition'] == '0A/1C/0G/1T'].copy()
nonbdb_STRs['TC']['Strand'] = '-'
nonbdb_STRs['AG'] = pd.concat([nonbdb_STRs['AG'], nonbdb_STRs['TC']])

nonbdb_STRs['AC'] = nonbdb_all_STRs.loc[nonbdb_all_STRs['Composition'] == '1A/1C/0G/0T'].copy()
nonbdb_STRs['AC']['Strand'] = '+'
nonbdb_STRs['TG'] = nonbdb_all_STRs.loc[nonbdb_all_STRs['Composition'] == '0A/0C/1G/1T'].copy()
nonbdb_STRs['TG']['Strand'] = '-'
nonbdb_STRs['AC'] = pd.concat([nonbdb_STRs['AC'], nonbdb_STRs['TG']])

nonbdb_STRs['AT'] = nonbdb_all_STRs.loc[nonbdb_all_STRs['Composition'] == '1A/0C/0G/1T'].copy()
nonbdb_STRs['GC'] = nonbdb_all_STRs.loc[nonbdb_all_STRs['Composition'] == '0A/1C/1G/0T'].copy()

In [None]:
# Inverted repeats
nonbdb_IRs = dict()
nonbdb_IRs['all'] = nonbdb.loc[(nonbdb['Type'] == 'Inverted_Repeat')].dropna(axis = 1, how = 'all').copy()
nonbdb_IRs['long'] = nonbdb_IRs['all'].loc[nonbdb_IRs['all']['repeat'] > nonbdb_IRs['all']['repeat'].quantile(0.8)].copy()
nonbdb_IRs['very_long'] = nonbdb_IRs['all'].loc[nonbdb_IRs['all']['repeat'] > nonbdb_IRs['all']['repeat'].quantile(0.95)].copy()
# IRs with shorter loop length
nonbdb_IRs['long_loop10'] = nonbdb_IRs['long'].loc[nonbdb_IRs['long']['Spacer'] < 11].copy()
del nonbdb_IRs['all']

In [None]:
# Mirror repeats
nonbdb_MRs = dict()
nonbdb_MRs['all'] = nonbdb.loc[(nonbdb['Type'] == 'Mirror_Repeat')].dropna(axis = 1, how = 'all').copy()
nonbdb_MRs['long'] = nonbdb_MRs['all'].loc[nonbdb_MRs['all']['repeat'] > nonbdb_MRs['all']['repeat'].quantile(0.8)].copy()
nonbdb_very_long_MRs = nonbdb_MRs['all'].loc[nonbdb_MRs['all']['repeat'] > nonbdb_MRs['all']['repeat'].quantile(0.95)].copy()
# MRs with shorter loop length
nonbdb_MRs['long_loop10'] = nonbdb_MRs['long'].loc[nonbdb_MRs['long']['Spacer'] < 11].copy()

nonbdb_MRs['long']['%purine'] = (nonbdb_MRs['long']['Sequence'].str.count('a') + nonbdb_MRs['long']['Sequence'].str.count('g')) / nonbdb_MRs['long']['length']

nonbdb_MRs['long_homopurine_AG'] = nonbdb_MRs['long'].loc[nonbdb_MRs['long']['%purine'] == 1].copy()
nonbdb_MRs['long_homopurine_AG']['Strand'] = '+'
nonbdb_MRs['long_homopurine_TC'] = nonbdb_MRs['long'].loc[nonbdb_MRs['long']['%purine'] == 0].copy()
nonbdb_MRs['long_homopurine_TC']['Strand'] = '-'
nonbdb_MRs['long_homopurine'] = pd.concat([nonbdb_MRs['long_homopurine_AG'], nonbdb_MRs['long_homopurine_TC']])

nonbdb_MRs['long_almost_homopurine_AG'] = nonbdb_MRs['long'].loc[nonbdb_MRs['long']['%purine'] > .90].copy()
nonbdb_MRs['long_almost_homopurine_AG']['Strand'] = '+'
nonbdb_MRs['long_almost_homopurine_TC'] = nonbdb_MRs['long'].loc[nonbdb_MRs['long']['%purine'] < 0.10].copy()
nonbdb_MRs['long_almost_homopurine_TC']['Strand'] = '-'
nonbdb_MRs['long_almost_homopurine'] = pd.concat([nonbdb_MRs['long_almost_homopurine_AG'], nonbdb_MRs['long_almost_homopurine_TC']])

nonbdb_MRs['long_non_homopurine'] = nonbdb_MRs['long'].loc[(nonbdb_MRs['long']['%purine'] < .80) & (nonbdb_MRs['long']['%purine'] > .20)].copy()

nonbdb_MRs['long_homopurine_loop10'] = nonbdb_MRs['long_homopurine'].loc[nonbdb_MRs['long_homopurine']['Spacer'] < 11]

del nonbdb_MRs['all']

In [None]:
# Direct repeats
nonbdb_DRs = dict()
nonbdb_DRs['all'] = nonbdb.loc[(nonbdb['Type'] == 'Direct_Repeat')].dropna(axis = 1, how = 'all').copy()
nonbdb_DRs['long'] = nonbdb_DRs['all'].loc[nonbdb_DRs['all']['repeat'] > nonbdb_DRs['all']['repeat'].quantile(0.8)].copy()
nonbdb_DRs['very_long'] = nonbdb_DRs['all'].loc[nonbdb_DRs['all']['repeat'] > nonbdb_DRs['all']['repeat'].quantile(0.95)].copy()
# DRs with shorter loop length
nonbdb_DRs['long_loop10'] = nonbdb_DRs['long'].loc[nonbdb_DRs['long']['Spacer'] < 11].copy()

del nonbdb_DRs['all']

In [None]:
# Z-DNA motifs
nonbdb_ZDNAs = dict()
nonbdb_ZDNAs['all'] = nonbdb.loc[(nonbdb['Type'] == 'Z_DNA_Motif')].dropna(axis = 1, how = 'all').copy()
nonbdb_ZDNAs['long'] = nonbdb_ZDNAs['all'].loc[nonbdb_ZDNAs['all']['length'] > nonbdb_ZDNAs['all']['length'].quantile(0.8)].copy()

In [None]:
# G4 motifs
nonbdb_G4s = dict()
nonbdb_G4s['all'] = nonbdb.loc[nonbdb['Type'] == 'G4'].dropna(axis = 1, how = 'all').copy()
nonbdb_G4s['highscore'] = nonbdb_G4s['all'].loc[nonbdb_G4s['all']['Score'] > nonbdb_G4s['all']['Score'].quantile(0.8)].copy()

### Count and analyze Non-B DB flanking mutations for each motif category <a name="mutation_surrounding_nonbdb_count"></a>

#### STRs <a name="mutation_surrounding_nonbdb_str"></a>

[Return to Table of Contents](#TOC)

In [None]:
nonbdb_count_long_STRs = dict()

for repeat in ['A', 'C', 'AC', 'AG']:
    print('\r' + '                     ' + repeat, end='  ')
    nonbdb_count_long_STRs[repeat] = count_mut_flank(nonbdb_STRs[repeat].copy(), qc_cutoff_list = [(-np.inf, -np.inf)], filter_cols = [], strand_col = 'Strand', useful_cols = ['Strand'])
for repeat in ['AT', 'GC']:
    print('\r' + '                     ' + repeat, end='  ')
    nonbdb_count_long_STRs[repeat] = count_mut_flank(nonbdb_STRs[repeat].copy(), qc_cutoff_list = [(-np.inf, -np.inf)], filter_cols = [], useful_cols = [])

In [None]:
nonbdb_count_long_STRs_noRM = dict()

for repeat in ['A', 'C', 'AC', 'AG']:
    print('\r' + '                     ' + repeat, end='  ')
    nonbdb_count_long_STRs_noRM[repeat] = count_mut_flank(nonbdb_STRs[repeat].loc[nonbdb_STRs[repeat]['RM_distance_min'] > 0].copy(), qc_cutoff_list = [(-np.inf, -np.inf)], filter_cols = ['RM_distance'], strand_col = 'Strand', useful_cols = ['Strand'])
for repeat in ['AT', 'GC']:
    print('\r' + '                     ' + repeat, end='  ')
    nonbdb_count_long_STRs_noRM[repeat] = count_mut_flank(nonbdb_STRs[repeat].loc[nonbdb_STRs[repeat]['RM_distance_min'] > 0].copy(), qc_cutoff_list = [(-np.inf, -np.inf)], filter_cols = ['RM_distance'], useful_cols = [])

In [None]:
nonbdb_count_long_STRs_overlaps = dict()

for repeat in ['A', 'C', 'AC', 'AG']:
    print('\r' + '                     ' + repeat, end='  ')
    nonbdb_count_long_STRs_overlaps[repeat] = count_mut_flank(nonbdb_STRs[repeat].loc[(nonbdb_STRs[repeat]['RM_distance_min'] > 0) & (nonbdb_STRs[repeat]['nonbdb_distance_min'] > 0)].copy(), qc_cutoff_list = [(-np.inf, -np.inf)], filter_cols = ['RM_distance', 'nonbdb_distance'], strand_col = 'Strand', useful_cols = ['Strand'])
for repeat in ['AT', 'GC']:
    print('\r' + '                     ' + repeat, end='  ')
    nonbdb_count_long_STRs_overlaps[repeat] = count_mut_flank(nonbdb_STRs[repeat].loc[(nonbdb_STRs[repeat]['RM_distance_min'] > 0) & (nonbdb_STRs[repeat]['nonbdb_distance_min'] > 0)].copy(), qc_cutoff_list = [(-np.inf, -np.inf)], filter_cols = ['RM_distance', 'nonbdb_distance'], useful_cols = [])

In [None]:
# Save temporary output of the mutation counts
# ACcorrect
with open('./analysis/temp/nonbdb_STRs_flank_ACcorrect_mutcount_chr'+str(chr_range)+'-22.pickle', 'wb') as handle:
    pickle.dump(nonbdb_count_long_STRs, handle, protocol=pickle.HIGHEST_PROTOCOL)
with open('./analysis/temp/nonbdb_STRs_flank_ACcorrect_mutcount_noRM_chr'+str(chr_range)+'-22.pickle', 'wb') as handle:
    pickle.dump(nonbdb_count_long_STRs_noRM, handle, protocol=pickle.HIGHEST_PROTOCOL)
with open('./analysis/temp/nonbdb_STRs_flank_ACcorrect_mutcount_overlaps_chr'+str(chr_range)+'-22.pickle', 'wb') as handle:
    pickle.dump(nonbdb_count_long_STRs_overlaps, handle, protocol=pickle.HIGHEST_PROTOCOL)

In [None]:
# Load temporary output of the mutation counts
with open('./analysis/temp/nonbdb_STRs_flank_ACcorrect_mutcount_chr'+str(chr_range)+'-22.pickle', 'rb') as handle:
    nonbdb_count_long_STRs = pickle.load(handle)
with open('./analysis/temp/nonbdb_STRs_flank_ACcorrect_mutcount_noRM_chr'+str(chr_range)+'-22.pickle', 'rb') as handle:
    nonbdb_count_long_STRs_noRM = pickle.load(handle)
with open('./analysis/temp/nonbdb_STRs_flank_ACcorrect_mutcount_overlaps_chr'+str(chr_range)+'-22.pickle', 'rb') as handle:
    nonbdb_count_long_STRs_overlaps = pickle.load(handle)

In [None]:
nonbdb_simple_freq_long_STRs = dict()
for repeat in nonbdb_count_long_STRs:
    nonbdb_simple_freq_long_STRs[repeat] = nonbdb_count_long_STRs[repeat][0][-np.inf].sum() / nonbdb_count_long_STRs[repeat][1].sum()

nonbdb_simple_freq_long_STRs_noRM = dict()
for repeat in nonbdb_count_long_STRs_noRM:
    nonbdb_simple_freq_long_STRs_noRM[repeat] = nonbdb_count_long_STRs_noRM[repeat][0][-np.inf].sum() / nonbdb_count_long_STRs_noRM[repeat][1].sum()

nonbdb_norm_long_STRs_noRM = dict()
for repeat in nonbdb_count_long_STRs_noRM:
    print('\r' + '                               ' + repeat, end='  ')
    nonbdb_norm_long_STRs_noRM[repeat] = mut_norm_conf(nonbdb_count_long_STRs_noRM[repeat], normtorandom = True, random_normaverage = normtorandom_all, do_binconf = True)

nonbdb_norm_long_STRs_overlaps = dict()
for repeat in nonbdb_count_long_STRs_overlaps:
    print('\r' + '                               ' + repeat, end='  ')
    nonbdb_norm_long_STRs_overlaps[repeat] = mut_norm_conf(nonbdb_count_long_STRs_overlaps[repeat], normtorandom = True, random_normaverage = normtorandom_all, do_binconf = True)

#### IRs <a name="mutation_surrounding_nonbdb_ir"></a>

[Return to Table of Contents](#TOC)

In [None]:
nonbdb_count_IRs = dict()
for repeat in nonbdb_IRs:
    print('\r' + '                     ' + repeat, end='  ')
    nonbdb_count_IRs[repeat] = count_mut_flank(nonbdb_IRs[repeat].copy(), qc_cutoff_list = [(-np.inf, -np.inf)], filter_cols = [], useful_cols = [])

In [None]:
nonbdb_count_IRs_noRM = dict()
for repeat in nonbdb_IRs:
    print('\r' + '                     ' + repeat, end='  ')
    nonbdb_count_IRs_noRM[repeat] = count_mut_flank(nonbdb_IRs[repeat].loc[nonbdb_IRs[repeat]['RM_distance_min'] > 0].copy(), qc_cutoff_list = [(-np.inf, -np.inf)], filter_cols = ['RM_distance'], useful_cols = [])

In [None]:
nonbdb_count_IRs_overlaps = dict()
for repeat in nonbdb_IRs:
    print('\r' + '                     ' + repeat, end='  ')
    nonbdb_count_IRs_overlaps[repeat] = count_mut_flank(nonbdb_IRs[repeat].loc[(nonbdb_IRs[repeat]['RM_distance_min'] > 0) & (nonbdb_IRs[repeat]['nonbdb_distance_min'] > 0)].copy(), qc_cutoff_list = [(-np.inf, -np.inf)], filter_cols = ['RM_distance', 'nonbdb_distance'], useful_cols = [])

In [None]:
# Save temporary output of the mutation counts
with open('./analysis/temp/nonbdb_IRs_flank_ACcorrect_mutcount_chr'+str(chr_range)+'-22.pickle', 'wb') as handle:
    pickle.dump(nonbdb_count_IRs, handle, protocol=pickle.HIGHEST_PROTOCOL)
with open('./analysis/temp/nonbdb_IRs_flank_ACcorrect_mutcount_noRM_chr'+str(chr_range)+'-22.pickle', 'wb') as handle:
    pickle.dump(nonbdb_count_IRs_noRM, handle, protocol=pickle.HIGHEST_PROTOCOL)
with open('./analysis/temp/nonbdb_IRs_flank_ACcorrect_mutcount_overlaps_chr'+str(chr_range)+'-22.pickle', 'wb') as handle:
    pickle.dump(nonbdb_count_IRs_overlaps, handle, protocol=pickle.HIGHEST_PROTOCOL)

In [None]:
# Load temporary output of the mutation counts
with open('./analysis/temp/nonbdb_IRs_flank_ACcorrect_mutcount_chr'+str(chr_range)+'-22.pickle', 'rb') as handle:
    nonbdb_count_IRs = pickle.load(handle)
with open('./analysis/temp/nonbdb_IRs_flank_ACcorrect_mutcount_noRM_chr'+str(chr_range)+'-22.pickle', 'rb') as handle:
    nonbdb_count_IRs_noRM = pickle.load(handle)
with open('./analysis/temp/nonbdb_IRs_flank_ACcorrect_mutcount_overlaps_chr'+str(chr_range)+'-22.pickle', 'rb') as handle:
    nonbdb_count_IRs_overlaps = pickle.load(handle)

In [None]:
nonbdb_simple_freq_IRs = dict()
for repeat in nonbdb_count_IRs:
    nonbdb_simple_freq_IRs[repeat] = nonbdb_count_IRs[repeat][0][-np.inf].sum() / nonbdb_count_IRs[repeat][1].sum()

nonbdb_simple_freq_IRs_noRM = dict()
for repeat in nonbdb_count_IRs_noRM:
    nonbdb_simple_freq_IRs_noRM[repeat] = nonbdb_count_IRs_noRM[repeat][0][-np.inf].sum() / nonbdb_count_IRs_noRM[repeat][1].sum()

nonbdb_norm_IRs_noRM = dict()
for repeat in nonbdb_count_IRs_noRM:
    print('\r' + '                               ' + repeat, end='  ')
    nonbdb_norm_IRs_noRM[repeat] = mut_norm_conf(nonbdb_count_IRs_noRM[repeat], normtorandom = True, random_normaverage = normtorandom_all, do_binconf = True)

nonbdb_norm_IRs_overlaps = dict()
for repeat in nonbdb_count_IRs_overlaps:
    print('\r' + '                               ' + repeat, end='  ')
    nonbdb_norm_IRs_overlaps[repeat] = mut_norm_conf(nonbdb_count_IRs_overlaps[repeat], normtorandom = True, random_normaverage = normtorandom_all, do_binconf = True)

#### DRs  <a name="mutation_surrounding_nonbdb_dr"></a>

[Return to Table of Contents](#TOC)

In [None]:
nonbdb_count_DRs = dict()
for repeat in nonbdb_DRs:
    print('\r' + '                     ' + repeat, end='  ')
    nonbdb_count_DRs[repeat] = count_mut_flank(nonbdb_DRs[repeat].copy(), qc_cutoff_list = [(-np.inf, -np.inf)], filter_cols = [], useful_cols = [])

nonbdb_count_DRs_noRM = dict()
for repeat in nonbdb_DRs:
    print('\r' + '                     ' + repeat, end='  ')
    nonbdb_count_DRs_noRM[repeat] = count_mut_flank(nonbdb_DRs[repeat].loc[nonbdb_DRs[repeat]['RM_distance_min'] > 0].copy(), qc_cutoff_list = [(-np.inf, -np.inf)], filter_cols = ['RM_distance'], useful_cols = [])

nonbdb_count_DRs_overlaps = dict()
for repeat in nonbdb_DRs:
    print('\r' + '                     ' + repeat, end='  ')
    nonbdb_count_DRs_overlaps[repeat] = count_mut_flank(nonbdb_DRs[repeat].loc[(nonbdb_DRs[repeat]['RM_distance_min'] > 0) & (nonbdb_DRs[repeat]['nonbdb_distance_min'] > 0)].copy(), qc_cutoff_list = [(-np.inf, -np.inf)], filter_cols = ['RM_distance', 'nonbdb_distance'], useful_cols = [])

In [None]:
# Save temporary output of the mutation counts
with open('./analysis/temp/nonbdb_DRs_flank_ACcorrect_mutcount_chr'+str(chr_range)+'-22.pickle', 'wb') as handle:
    pickle.dump(nonbdb_count_DRs, handle, protocol=pickle.HIGHEST_PROTOCOL)
with open('./analysis/temp/nonbdb_DRs_flank_ACcorrect_mutcount_noRM_chr'+str(chr_range)+'-22.pickle', 'wb') as handle:
    pickle.dump(nonbdb_count_DRs_noRM, handle, protocol=pickle.HIGHEST_PROTOCOL)
with open('./analysis/temp/nonbdb_DRs_flank_ACcorrect_mutcount_overlaps_chr'+str(chr_range)+'-22.pickle', 'wb') as handle:
    pickle.dump(nonbdb_count_DRs_overlaps, handle, protocol=pickle.HIGHEST_PROTOCOL)

In [None]:
# Load temporary output of the mutation counts
with open('./analysis/temp/nonbdb_DRs_flank_ACcorrect_mutcount_chr'+str(chr_range)+'-22.pickle', 'rb') as handle:
    nonbdb_count_DRs = pickle.load(handle)
with open('./analysis/temp/nonbdb_DRs_flank_ACcorrect_mutcount_noRM_chr'+str(chr_range)+'-22.pickle', 'rb') as handle:
    nonbdb_count_DRs_noRM = pickle.load(handle)
with open('./analysis/temp/nonbdb_DRs_flank_ACcorrect_mutcount_overlaps_chr'+str(chr_range)+'-22.pickle', 'rb') as handle:
    nonbdb_count_DRs_overlaps = pickle.load(handle)

In [None]:
nonbdb_simple_freq_DRs = dict()
for repeat in nonbdb_count_DRs:
    nonbdb_simple_freq_DRs[repeat] = nonbdb_count_DRs[repeat][0][-np.inf].sum() / nonbdb_count_DRs[repeat][1].sum()

nonbdb_simple_freq_DRs_noRM = dict()
for repeat in nonbdb_count_DRs_noRM:
    nonbdb_simple_freq_DRs_noRM[repeat] = nonbdb_count_DRs_noRM[repeat][0][-np.inf].sum() / nonbdb_count_DRs_noRM[repeat][1].sum()

nonbdb_norm_DRs_noRM = dict()
for repeat in nonbdb_count_DRs_noRM:
    print('\r' + '                               ' + repeat, end='  ')
    nonbdb_norm_DRs_noRM[repeat] = mut_norm_conf(nonbdb_count_DRs_noRM[repeat], normtorandom = True, random_normaverage = normtorandom_all, do_binconf = True)

nonbdb_norm_DRs_overlaps = dict()
for repeat in nonbdb_count_DRs_overlaps:
    print('\r' + '                               ' + repeat, end='  ')
    nonbdb_norm_DRs_overlaps[repeat] = mut_norm_conf(nonbdb_count_DRs_overlaps[repeat], normtorandom = True, random_normaverage = normtorandom_all, do_binconf = True)

#### MRs  <a name="mutation_surrounding_nonbdb_mr"></a>

[Return to Table of Contents](#TOC)

In [None]:
nonbdb_count_MRs = dict()
for repeat in ['long', 'long_loop10']:
    print('\r' + '                     ' + repeat, end='  ')
    nonbdb_count_MRs[repeat] = count_mut_flank(nonbdb_MRs[repeat].copy(), qc_cutoff_list = [(-np.inf, -np.inf)], filter_cols = [], strand_col = 'Strand', useful_cols = ['Strand'])
for repeat in ['long_homopurine', 'long_almost_homopurine', 'long_homopurine_loop10']:
    print('\r' + '                     ' + repeat, end='  ')
    nonbdb_count_MRs[repeat] = count_mut_flank(nonbdb_MRs[repeat].copy(), qc_cutoff_list = [(-np.inf, -np.inf)], filter_cols = [], useful_cols = [])

nonbdb_count_MRs_noRM = dict()
for repeat in ['long', 'long_loop10']:
    print('\r' + '                     ' + repeat, end='  ')
    nonbdb_count_MRs_noRM[repeat] = count_mut_flank(nonbdb_MRs[repeat].loc[nonbdb_MRs[repeat]['RM_distance_min'] > 0].copy(), qc_cutoff_list = [(-np.inf, -np.inf)], filter_cols = ['RM_distance'], strand_col = 'Strand', useful_cols = ['Strand'])
for repeat in ['long_homopurine', 'long_almost_homopurine', 'long_homopurine_loop10']:
    print('\r' + '                     ' + repeat, end='  ')
    nonbdb_count_MRs_noRM[repeat] = count_mut_flank(nonbdb_MRs[repeat].loc[nonbdb_MRs[repeat]['RM_distance_min'] > 0].copy(), qc_cutoff_list = [(-np.inf, -np.inf)], filter_cols = ['RM_distance'], useful_cols = [])

nonbdb_count_MRs_overlaps = dict()
for repeat in ['long', 'long_loop10']:
    print('\r' + '                     ' + repeat, end='  ')
    nonbdb_count_MRs_overlaps[repeat] = count_mut_flank(nonbdb_MRs[repeat].loc[(nonbdb_MRs[repeat]['RM_distance_min'] > 0) & (nonbdb_MRs[repeat]['nonbdb_distance_min'] > 0)].copy(), qc_cutoff_list = [(-np.inf, -np.inf)], filter_cols = ['RM_distance'], strand_col = 'Strand', useful_cols = ['Strand'])
for repeat in ['long_homopurine', 'long_almost_homopurine', 'long_homopurine_loop10']:
    print('\r' + '                     ' + repeat, end='  ')
    nonbdb_count_MRs_overlaps[repeat] = count_mut_flank(nonbdb_MRs[repeat].loc[(nonbdb_MRs[repeat]['RM_distance_min'] > 0) & (nonbdb_MRs[repeat]['nonbdb_distance_min'] > 0)].copy(), qc_cutoff_list = [(-np.inf, -np.inf)], filter_cols = ['RM_distance'], useful_cols = [])

In [None]:
# Save temporary output of the mutation counts
with open('./analysis/temp/nonbdb_MRs_ACcorrect_flank_mutcount_chr'+str(chr_range)+'-22.pickle', 'wb') as handle:
    pickle.dump(nonbdb_count_MRs, handle, protocol=pickle.HIGHEST_PROTOCOL)
with open('./analysis/temp/nonbdb_MRs_ACcorrect_flank_mutcount_noRM_chr'+str(chr_range)+'-22.pickle', 'wb') as handle:
    pickle.dump(nonbdb_count_MRs_noRM, handle, protocol=pickle.HIGHEST_PROTOCOL)
with open('./analysis/temp/nonbdb_MRs_ACcorrect_flank_mutcount_overlaps_chr'+str(chr_range)+'-22.pickle', 'wb') as handle:
    pickle.dump(nonbdb_count_MRs_overlaps, handle, protocol=pickle.HIGHEST_PROTOCOL)

In [None]:
# Load temporary output of the mutation counts
with open('./analysis/temp/nonbdb_MRs_ACcorrect_flank_mutcount_chr'+str(chr_range)+'-22.pickle', 'rb') as handle:
    nonbdb_count_MRs = pickle.load(handle)
with open('./analysis/temp/nonbdb_MRs_ACcorrect_flank_mutcount_noRM_chr'+str(chr_range)+'-22.pickle', 'rb') as handle:
    nonbdb_count_MRs_noRM = pickle.load(handle)
with open('./analysis/temp/nonbdb_MRs_ACcorrect_flank_mutcount_overlaps_chr'+str(chr_range)+'-22.pickle', 'rb') as handle:
    nonbdb_count_MRs_overlaps = pickle.load(handle)

In [None]:
nonbdb_simple_freq_MRs = dict()
for repeat in nonbdb_count_MRs:
    nonbdb_simple_freq_MRs[repeat] = nonbdb_count_MRs[repeat][0][-np.inf].sum() / nonbdb_count_MRs[repeat][1].sum()

nonbdb_simple_freq_MRs_noRM = dict()
for repeat in nonbdb_count_MRs_noRM:
    nonbdb_simple_freq_MRs_noRM[repeat] = nonbdb_count_MRs_noRM[repeat][0][-np.inf].sum() / nonbdb_count_MRs_noRM[repeat][1].sum()

nonbdb_norm_MRs_noRM = dict()
for repeat in nonbdb_count_MRs_noRM:
    print('\r' + '                               ' + repeat, end='  ')
    nonbdb_norm_MRs_noRM[repeat] = mut_norm_conf(nonbdb_count_MRs_noRM[repeat], normtorandom = True, random_normaverage = normtorandom_all, do_binconf = True)

nonbdb_norm_MRs_overlaps = dict()
for repeat in nonbdb_count_MRs_overlaps:
    print('\r' + '                               ' + repeat, end='  ')
    nonbdb_norm_MRs_overlaps[repeat] = mut_norm_conf(nonbdb_count_MRs_overlaps[repeat], normtorandom = True, random_normaverage = normtorandom_all, do_binconf = True)

#### ZDNA  <a name="mutation_surrounding_nonbdb_zdna"></a>

[Return to Table of Contents](#TOC)

In [None]:
nonbdb_count_ZDNAs = dict()
for repeat in nonbdb_ZDNAs:
    print('\r' + '                     ' + repeat, end='  ')
    nonbdb_count_ZDNAs[repeat] = count_mut_flank(nonbdb_ZDNAs[repeat].copy(), qc_cutoff_list = [(-np.inf, -np.inf)], filter_cols = [], useful_cols = [])

nonbdb_count_ZDNAs_noRM = dict()
for repeat in nonbdb_ZDNAs:
    print('\r' + '                     ' + repeat, end='  ')
    nonbdb_count_ZDNAs_noRM[repeat] = count_mut_flank(nonbdb_ZDNAs[repeat].loc[nonbdb_ZDNAs[repeat]['RM_distance_min'] > 0].copy(), qc_cutoff_list = [(-np.inf, -np.inf)], filter_cols = ['RM_distance'], useful_cols = [])

nonbdb_count_ZDNAs_overlaps = dict()
for repeat in nonbdb_ZDNAs:
    print('\r' + '                     ' + repeat, end='  ')
    nonbdb_count_ZDNAs_overlaps[repeat] = count_mut_flank(nonbdb_ZDNAs[repeat].loc[(nonbdb_ZDNAs[repeat]['RM_distance_min'] > 0) & (nonbdb_ZDNAs[repeat]['nonbdb_distance_min'] > 0)].copy(), qc_cutoff_list = [(-np.inf, -np.inf)], filter_cols = ['RM_distance', 'nonbdb_distance'], useful_cols = [])

In [None]:
# Save temporary output of the mutation counts
with open('./analysis/temp/nonbdb_ZDNAs_ACcorrect_flank_mutcount_chr'+str(chr_range)+'-22.pickle', 'wb') as handle:
    pickle.dump(nonbdb_count_ZDNAs, handle, protocol=pickle.HIGHEST_PROTOCOL)
with open('./analysis/temp/nonbdb_ZDNAs_ACcorrect_flank_mutcount_noRM_chr'+str(chr_range)+'-22.pickle', 'wb') as handle:
    pickle.dump(nonbdb_count_ZDNAs_noRM, handle, protocol=pickle.HIGHEST_PROTOCOL)
with open('./analysis/temp/nonbdb_ZDNAs_ACcorrect_flank_mutcount_overlaps_chr'+str(chr_range)+'-22.pickle', 'wb') as handle:
    pickle.dump(nonbdb_count_ZDNAs_overlaps, handle, protocol=pickle.HIGHEST_PROTOCOL)

In [None]:
# Load temporary output of the mutation counts
with open('./analysis/temp/nonbdb_ZDNAs_ACcorrect_flank_mutcount_chr'+str(chr_range)+'-22.pickle', 'rb') as handle:
    nonbdb_count_ZDNAs = pickle.load(handle)
with open('./analysis/temp/nonbdb_ZDNAs_ACcorrect_flank_mutcount_noRM_chr'+str(chr_range)+'-22.pickle', 'rb') as handle:
    nonbdb_count_ZDNAs_noRM = pickle.load(handle)
with open('./analysis/temp/nonbdb_ZDNAs_ACcorrect_flank_mutcount_overlaps_chr'+str(chr_range)+'-22.pickle', 'rb') as handle:
    nonbdb_count_ZDNAs_overlaps = pickle.load(handle)

In [None]:
nonbdb_simple_freq_ZDNAs = dict()
for repeat in nonbdb_count_ZDNAs:
    nonbdb_simple_freq_ZDNAs[repeat] = nonbdb_count_ZDNAs[repeat][0][-np.inf].sum() / nonbdb_count_ZDNAs[repeat][1].sum()

nonbdb_simple_freq_ZDNAs_noRM = dict()
for repeat in nonbdb_count_ZDNAs_noRM:
    nonbdb_simple_freq_ZDNAs_noRM[repeat] = nonbdb_count_ZDNAs_noRM[repeat][0][-np.inf].sum() / nonbdb_count_ZDNAs_noRM[repeat][1].sum()

nonbdb_norm_ZDNAs_noRM = dict()
for repeat in nonbdb_count_ZDNAs_noRM:
    print('\r' + '                               ' + repeat, end='  ')
    nonbdb_norm_ZDNAs_noRM[repeat] = mut_norm_conf(nonbdb_count_ZDNAs_noRM[repeat], normtorandom = True, random_normaverage = normtorandom_all, do_binconf = True)

nonbdb_norm_ZDNAs_overlaps = dict()
for repeat in nonbdb_count_ZDNAs_overlaps:
    print('\r' + '                               ' + repeat, end='  ')
    nonbdb_norm_ZDNAs_overlaps[repeat] = mut_norm_conf(nonbdb_count_ZDNAs_overlaps[repeat], normtorandom = True, random_normaverage = normtorandom_all, do_binconf = True)

#### G4  <a name="mutation_surrounding_nonbdb_g4"></a>

[Return to Table of Contents](#TOC)

In [None]:
nonbdb_count_G4s = dict()
for repeat in nonbdb_G4s:
    print('\r' + '                     ' + repeat, end='  ')
    nonbdb_count_G4s[repeat] = count_mut_flank(nonbdb_G4s[repeat].copy(), qc_cutoff_list = [(-np.inf, -np.inf)], filter_cols = [], strand_col = 'Strand', useful_cols = ['Strand'])

nonbdb_count_G4s_noRM = dict()
for repeat in nonbdb_G4s:
    print('\r' + '                     ' + repeat, end='  ')
    nonbdb_count_G4s_noRM[repeat] = count_mut_flank(nonbdb_G4s[repeat].loc[nonbdb_G4s[repeat]['RM_distance_min'] > 0].copy(), qc_cutoff_list = [(-np.inf, -np.inf)], filter_cols = ['RM_distance'], strand_col = 'Strand', useful_cols = ['Strand'])

nonbdb_count_G4s_overlaps = dict()
for repeat in nonbdb_G4s:
    print('\r' + '                     ' + repeat, end='  ')
    nonbdb_count_G4s_overlaps[repeat] = count_mut_flank(nonbdb_G4s[repeat].loc[(nonbdb_G4s[repeat]['RM_distance_min'] > 0) & (nonbdb_G4s[repeat]['nonbdb_distance_min'] > 0)].copy(), qc_cutoff_list = [(-np.inf, -np.inf)], filter_cols = ['RM_distance', 'nonbdb_distance'], strand_col = 'Strand', useful_cols = ['Strand'])

In [None]:
# Save temporary output of the mutation counts
with open('./analysis/temp/nonbdb_G4s_flank_ACcorrect_mutcount_chr'+str(chr_range)+'-22.pickle', 'wb') as handle:
    pickle.dump(nonbdb_count_G4s, handle, protocol=pickle.HIGHEST_PROTOCOL)
with open('./analysis/temp/nonbdb_G4s_flank_ACcorrect_mutcount_noRM_chr'+str(chr_range)+'-22.pickle', 'wb') as handle:
    pickle.dump(nonbdb_count_G4s_noRM, handle, protocol=pickle.HIGHEST_PROTOCOL)
with open('./analysis/temp/nonbdb_G4s_flank_ACcorrect_mutcount_overlaps_chr'+str(chr_range)+'-22.pickle', 'wb') as handle:
    pickle.dump(nonbdb_count_G4s_overlaps, handle, protocol=pickle.HIGHEST_PROTOCOL)

In [None]:
# Load temporary output of the mutation counts
with open('./analysis/temp/nonbdb_G4s_flank_ACcorrect_mutcount_chr'+str(chr_range)+'-22.pickle', 'rb') as handle:
    nonbdb_count_G4s = pickle.load(handle)
with open('./analysis/temp/nonbdb_G4s_flank_ACcorrect_mutcount_noRM_chr'+str(chr_range)+'-22.pickle', 'rb') as handle:
    nonbdb_count_G4s_noRM = pickle.load(handle)
with open('./analysis/temp/nonbdb_G4s_flank_ACcorrect_mutcount_overlaps_chr'+str(chr_range)+'-22.pickle', 'rb') as handle:
    nonbdb_count_G4s_overlaps = pickle.load(handle)

In [None]:
nonbdb_simple_freq_G4s = dict()
for repeat in nonbdb_count_G4s:
    nonbdb_simple_freq_G4s[repeat] = nonbdb_count_G4s[repeat][0][-np.inf].sum() / nonbdb_count_G4s[repeat][1].sum()

nonbdb_simple_freq_G4s_noRM = dict()
for repeat in nonbdb_count_G4s_noRM:
    nonbdb_simple_freq_G4s_noRM[repeat] = nonbdb_count_G4s_noRM[repeat][0][-np.inf].sum() / nonbdb_count_G4s_noRM[repeat][1].sum()

nonbdb_norm_G4s_noRM = dict()
for repeat in nonbdb_count_G4s_noRM:
    print('\r' + '                               ' + repeat, end='  ')
    nonbdb_norm_G4s_noRM[repeat] = mut_norm_conf(nonbdb_count_G4s_noRM[repeat], normtorandom = True, random_normaverage = normtorandom_all, do_binconf = True)

nonbdb_norm_G4s_overlaps = dict()
for repeat in nonbdb_count_G4s_overlaps:
    print('\r' + '                               ' + repeat, end='  ')
    nonbdb_norm_G4s_overlaps[repeat] = mut_norm_conf(nonbdb_count_G4s_overlaps[repeat], normtorandom = True, random_normaverage = normtorandom_all, do_binconf = True)

#### Save/load NonBDB mutation frequencies  <a name="mutation_surrounding_nonbdb_saveload"></a>

[Return to Table of Contents](#TOC)

In [None]:
# Save temporary output of the mutation counts

nonbdb_flanking_all = dict()

nonbdb_flanking_all['simple_freq'] = dict()
nonbdb_flanking_all['simple_freq']['STR'] = nonbdb_simple_freq_long_STRs.copy()
nonbdb_flanking_all['simple_freq']['IR'] = nonbdb_simple_freq_IRs.copy()
nonbdb_flanking_all['simple_freq']['DR'] = nonbdb_simple_freq_DRs.copy()
nonbdb_flanking_all['simple_freq']['MR'] = nonbdb_simple_freq_MRs.copy()
nonbdb_flanking_all['simple_freq']['ZDNA'] = nonbdb_simple_freq_ZDNAs.copy()
nonbdb_flanking_all['simple_freq']['G4'] = nonbdb_simple_freq_G4s.copy()

nonbdb_flanking_all['no_RM'] = dict()
nonbdb_flanking_all['no_RM']['STR'] = nonbdb_simple_freq_long_STRs_noRM.copy()
nonbdb_flanking_all['no_RM']['IR'] = nonbdb_simple_freq_IRs_noRM.copy()
nonbdb_flanking_all['no_RM']['DR'] = nonbdb_simple_freq_DRs_noRM.copy()
nonbdb_flanking_all['no_RM']['MR'] = nonbdb_simple_freq_MRs_noRM.copy()
nonbdb_flanking_all['no_RM']['ZDNA'] = nonbdb_simple_freq_ZDNAs_noRM.copy()
nonbdb_flanking_all['no_RM']['G4'] = nonbdb_simple_freq_G4s_noRM.copy()

nonbdb_flanking_all['tri_norm'] = dict()
nonbdb_flanking_all['tri_norm']['STR'] = nonbdb_norm_long_STRs_noRM.copy()
nonbdb_flanking_all['tri_norm']['IR'] = nonbdb_norm_IRs_noRM.copy()
nonbdb_flanking_all['tri_norm']['DR'] = nonbdb_norm_DRs_noRM.copy()
nonbdb_flanking_all['tri_norm']['MR'] = nonbdb_norm_MRs_noRM.copy()
nonbdb_flanking_all['tri_norm']['ZDNA'] = nonbdb_norm_ZDNAs_noRM.copy()
nonbdb_flanking_all['tri_norm']['G4'] = nonbdb_norm_G4s_noRM.copy()

nonbdb_flanking_all['no_overlaps'] = dict()
nonbdb_flanking_all['no_overlaps']['STR'] = nonbdb_norm_long_STRs_overlaps.copy()
nonbdb_flanking_all['no_overlaps']['IR'] = nonbdb_norm_IRs_overlaps.copy()
nonbdb_flanking_all['no_overlaps']['DR'] = nonbdb_norm_DRs_overlaps.copy()
nonbdb_flanking_all['no_overlaps']['MR'] = nonbdb_norm_MRs_overlaps.copy()
nonbdb_flanking_all['no_overlaps']['ZDNA'] = nonbdb_norm_ZDNAs_overlaps.copy()
nonbdb_flanking_all['no_overlaps']['G4'] = nonbdb_norm_G4s_overlaps.copy()

with open('./analysis/temp/nonbdb_flank_all_ACcorrect_chr'+str(chr_range)+'-22.pickle', 'wb') as handle:
    pickle.dump(nonbdb_flanking_all, handle, protocol=pickle.HIGHEST_PROTOCOL)

In [None]:
# Load temporary output of the mutation counts
with open('./analysis/temp/nonbdb_flank_all_ACcorrect_chr'+str(chr_range)+'-22.pickle', 'rb') as handle:
    nonbdb_flanking_all = pickle.load(handle)

### Count flanking mutations using subsampled gnomAD <a name="mutation_surrounding_subsample"></a>

[Return to Table of Contents](#TOC)

In [None]:
# Count mutations surrounding random sequences
count_random_subsample = dict()
for sample_frac in [0.1, 0.01, 0.004]:
    print('\r' + '                                   sample_frac ' + str(sample_frac), end='  ')
    count_random_subsample[sample_frac] = count_mut_flank(random_seq, input_mut_dict = gnomad_slim_downsample[sample_frac], useful_cols = ['Strand'], gc_nmer = 51)

norm_random_subsample = dict()
for sample_frac in [0.1, 0.01, 0.004]:
    print('\r' + '                                   sample_frac ' + str(sample_frac), end='  ')
    norm_random_subsample[sample_frac] = mut_norm_conf(count_random_subsample[sample_frac], genome_AC_freq_current = genome_AC_freq_all_downsample[sample_frac], n_genomes = round(gnomad_n_genomes * sample_frac), gc_correct = True)

In [None]:
# Median SNV frequencies for each QC filter, used to normalize all later mutation frequencies
normtorandom_subsample = dict()
for sample_frac in [0.1, 0.01, 0.004]:
    normtorandom_subsample[sample_frac] = pd.Series([norm_random_subsample[sample_frac][0][qc_cutoff].loc[-50:50].median() for qc_cutoff in norm_random_subsample[sample_frac][0]], index = list(norm_random_subsample[sample_frac][0]))

In [None]:
# Count mutations surrounding STRs
count_long_STRs_subsample = dict()
for sample_frac in [0.1, 0.01, 0.004]:
    print('\r' + '                                   sample_frac ' + str(sample_frac), end='  ')
    count_long_STRs_subsample[sample_frac] = dict()
    for repeat in ['A', 'C', 'AC', 'AG']:
        print('\r' + '                     ' + repeat, end='  ')
        count_long_STRs_subsample[sample_frac][repeat] = count_mut_flank(long_STRs[repeat], input_mut_dict = gnomad_slim_downsample[sample_frac], useful_cols = ['Strand'], gc_nmer = 51)
    for repeat in repeats_highpower_sym:
        print('\r' + '                     ' + repeat, end='  ')
        count_long_STRs_subsample[sample_frac][repeat] = count_mut_flank(long_STRs[repeat], input_mut_dict = gnomad_slim_downsample[sample_frac], gc_nmer = 51)
    count_long_STRs_subsample[sample_frac]['GC_noCGI'] = count_mut_flank(STRs_CG, input_mut_dict = gnomad_slim_downsample[sample_frac], filter_cols = ['RM_distance', 'STR_distance', 'nonSTR_distance', 'within_motif_distance', 'CGI_distance'], gc_nmer = 51)

In [None]:
norm_long_STRs_subsample = dict()
for sample_frac in [0.1, 0.01, 0.004]:
    print('\r' + '                                   sample_frac ' + str(sample_frac), end='  ')
    norm_long_STRs_subsample[sample_frac] = dict()
    for repeat in count_long_STRs_subsample[sample_frac]:
        print('\r' + '                               ' + repeat, end='  ')
        norm_long_STRs_subsample[sample_frac][repeat] = mut_norm_conf(count_long_STRs_subsample[sample_frac][repeat], genome_AC_freq_current = genome_AC_freq_all_downsample[sample_frac], n_genomes = round(gnomad_n_genomes * sample_frac), normtorandom = True, random_normaverage = normtorandom_subsample[sample_frac], do_binconf = True, gc_correct = True)

In [None]:
# Save temporary output of the mutation counts
flanking_norm_subsample = dict()
for sample_frac in [0.1, 0.01, 0.004]:
    flanking_norm_subsample[sample_frac] = dict()
    flanking_norm_subsample[sample_frac]['STR'] = norm_long_STRs_subsample[sample_frac].copy()

with open('./analysis/temp/flank_norm_ACcorrect_downsample_chr'+str(chr_range)+'-22.pickle', 'wb') as handle:
    pickle.dump(flanking_norm_subsample, handle, protocol=pickle.HIGHEST_PROTOCOL)

In [None]:
# Load temporary output of the mutation counts
with open('./analysis/temp/flank_norm_ACcorrect_downsample_chr'+str(chr_range)+'-22.pickle', 'rb') as handle:
    flanking_norm_subsample = pickle.load(handle)

## Plot flanking mutation frequencies <a name="mutation_surrounding_plot"></a>

[Return to Table of Contents](#TOC)

In [None]:
# Color scale for QC in plots
vqslod_list = [-np.inf, -2.774, 0, 4]
QC_colors = make_colorscale(vqslod_list)
QC_colors = pd.DataFrame(QC_colors).transpose()
QC_colors['name'] = ['no QC', 'pass', 'VQSLOD >0', 'VQSLOD >4']

QC_colors_denovo = pd.Series(['rgba(0, 0, 0, 0.5)', 'rgba(0, 0, 0, 0.15)', 'de novo'], index = [0,1, 'name'])

In [None]:
# Function to generate individual plot traces
def plot_flank_add(motif, category, display_name, plot_name, row_n, col_n, input_dict = flanking_norm_all, showleg = False):
    for QCfilter in vqslod_list:
        plot_name.add_trace(go.Scatter(name = QC_colors['name'][QCfilter], legendgroup = QCfilter, showlegend=False, fill=None, x = input_dict[motif][category][2][QCfilter].dropna().index, y = input_dict[motif][category][2][QCfilter].dropna(), mode = 'lines', line = dict(width = 0, color = QC_colors[0][QCfilter])), row=row_n, col=col_n)
        plot_name.add_trace(go.Scatter(name = QC_colors['name'][QCfilter], legendgroup = QCfilter, showlegend = showleg, fill='tonexty', fillcolor = QC_colors[1][QCfilter], x = input_dict[motif][category][0][QCfilter].index, y = input_dict[motif][category][0][QCfilter], mode = 'lines', line = dict(width = 2, color = QC_colors[0][QCfilter])), row=row_n, col=col_n)
        plot_name.add_trace(go.Scatter(name = QC_colors['name'][QCfilter], legendgroup = QCfilter, showlegend=False, fill='tonexty', fillcolor = QC_colors[1][QCfilter], x = input_dict[motif][category][1][QCfilter].dropna().index, y = input_dict[motif][category][1][QCfilter].dropna(), mode = 'lines', line = dict(width = 0, color = QC_colors[0][QCfilter])), row=row_n, col=col_n)
    plot_name.add_shape(type='line', x0=-500, y0=1, x1=500, y1=1, line=dict(color='Black', width = .3), row=row_n, col=col_n)
    plot_name.update_yaxes(zeroline = False, title = dict(text = display_name, font = dict(size = 18 if len(display_name) < 10 else round(18* (10/len(display_name))))), title_standoff = 0, row = row_n, col = col_n)

def plot_flank_add_denovo(motif, category, display_name, plot_name, row_n, col_n, input_dict = flanking_norm_denovo, showleg = False):
    plot_name.add_trace(go.Scatter(name = QC_colors_denovo['name'], legendgroup = 'denovo', showlegend=False, fill=None, x = input_dict[motif][category][2]['denovo'].dropna().index, y = input_dict[motif][category][2]['denovo'].dropna(), mode = 'lines', line = dict(width = 0, color = QC_colors_denovo[0])), row=row_n, col=col_n)
    plot_name.add_trace(go.Scatter(name = QC_colors_denovo['name'], legendgroup = 'denovo', showlegend = showleg, fill='tonexty', fillcolor = QC_colors_denovo[1], x = input_dict[motif][category][0]['denovo'].index, y = input_dict[motif][category][0]['denovo'], mode = 'lines', line = dict(width = 2, color = QC_colors_denovo[0])), row=row_n, col=col_n)
    plot_name.add_trace(go.Scatter(name = QC_colors_denovo['name'], legendgroup = 'denovo', showlegend=False, fill='tonexty', fillcolor = QC_colors_denovo[1], x = input_dict[motif][category][1]['denovo'].dropna().index, y = input_dict[motif][category][1]['denovo'].dropna(), mode = 'lines', line = dict(width = 0, color = QC_colors_denovo[0])), row=row_n, col=col_n)
    plot_name.add_shape(type='line', x0=-500, y0=1, x1=500, y1=1, line=dict(color='Black', width = .3), row=row_n, col=col_n)
    plot_name.update_yaxes(zeroline = False, title = dict(text = display_name, font = dict(size = 18 if len(display_name) < 10 else round(18* (10/len(display_name))))), title_standoff = 0, row = row_n, col = col_n)

    
# Function to generate individual plots
def plot_flank(motif, category, input_dict = flanking_norm_all, denovo = False):
    mutnorm_all_binconf_fig = make_subplots()
    plot_flank_add(motif = motif, category = category, display_name = motif + ' ' + category, plot_name = mutnorm_all_binconf_fig, row_n = 1, col_n = 1, input_dict = input_dict, showleg = True)
    if denovo == True:
        plot_flank_add_denovo(motif = motif, category = category, display_name = motif + ' ' + category, plot_name = mutnorm_all_binconf_fig, row_n = 1, col_n = 1, showleg = True)
    mutnorm_all_binconf_fig.update_xaxes(range = [-250,250])
    mutnorm_all_binconf_fig.update_yaxes(range = [0.1,3.1], zeroline = False)
    return mutnorm_all_binconf_fig

#### Plot for Figure 2 <a name="mutation_surrounding_plot_fig2"></a>

[Return to Table of Contents](#TOC)

In [None]:
mutnorm_all_binconf_fig_combined_fig2 = make_subplots(rows = 2, cols = 1, shared_xaxes = True, vertical_spacing=0.02, horizontal_spacing = 0.11)
counter = 0
for motif, category, name in zip(['STR'], ['A'], ['A-mono']):
    counter +=1
    plot_flank_add(motif = motif, category = category, display_name = name, plot_name = mutnorm_all_binconf_fig_combined_fig2, row_n = counter, col_n = 1, showleg = True)

for motif, category, name in zip(['DR'], ['long_DRs'], ['Direct']):
    counter +=1
    plot_flank_add(motif = motif, category = category, display_name = name, plot_name = mutnorm_all_binconf_fig_combined_fig2, row_n = counter, col_n = 1)
    mutnorm_all_binconf_fig_combined_fig2.add_trace(go.Scatter(showlegend = True, x = nonbdb_flanking_all['tri_norm'][motif]['long'][0][-np.inf].index, y = nonbdb_flanking_all['tri_norm'][motif]['long'][0][-np.inf], marker = dict(color = 'rgba(80,80,80,0.4)'), name = 'uncorrected', connectgaps = True, legendgroup = 'bad'), row = counter, col = 1)
        
mutnorm_all_binconf_fig_combined_fig2.update_yaxes(range = [0.25,3.1], row = 1, col = 1)
mutnorm_all_binconf_fig_combined_fig2.update_yaxes(range = [0.25,3.1], row = 2, col = 1)
mutnorm_all_binconf_fig_combined_fig2.update_xaxes(range = [-250,250], tickmode = 'array', tickvals = list(range(-250,300,50)))
mutnorm_all_binconf_fig_combined_fig2.update_layout(height = 400, width = 650, margin = dict(l = 55, r = 20, b = 45, t = 40), legend=dict(y = 1.10, x = -0.05, orientation='h', itemwidth = 30))
mutnorm_all_binconf_fig_combined_fig2.show()

In [None]:
mutnorm_all_binconf_fig_combined_fig2.write_image('./plots/revision_ACcor_flanking_mutation_fig_2.png', format='png', scale = 10, engine = 'orca')

#### Combination plot for Supplementary Figure S2A <a name="mutation_surrounding_plot_figS2A"></a>
- includes NonB-DB unfiltered mutation frequencies 

[Return to Table of Contents](#TOC)

In [None]:
# Generate Supplementary Figure S2A

mutnorm_all_binconf_fig_combined_figS2a = make_subplots(rows = 5, cols = 3, shared_xaxes = True, vertical_spacing=0.02, horizontal_spacing = 0.11, column_widths=[5/11, 3/11, 3/11])

counter = 0
motif = 'STR'
for category, name in zip(['A', 'C', 'AT', 'AC', 'AG'], ['A', 'C', 'AT', 'AC', 'AG']):
    counter +=1
    plot_flank_add(motif = motif, category = category, display_name = name, plot_name = mutnorm_all_binconf_fig_combined_figS2a, row_n = counter, col_n = 1, showleg = True if counter == 1 else False)

counter = 0
for category, name in zip(['ACC', 'AAC', 'AAG', 'AAT', 'GC'], ['CAC', 'CAA', 'GAA', 'TAA', 'GC']):
    counter +=1
    plot_flank_add(motif = motif, category = category, display_name = name, plot_name = mutnorm_all_binconf_fig_combined_figS2a, row_n = counter, col_n = 2)
    
counter = 0
for category, name in zip(['AGC', 'ATC', 'AGG', 'AAAT', 'GC_noCGI'], ['CAG', 'CAT', 'GGA', 'TAAA', 'GC (outside CGIs)']):
    counter +=1
    plot_flank_add(motif = motif, category = category, display_name = name, plot_name = mutnorm_all_binconf_fig_combined_figS2a, row_n = counter, col_n = 3)

mutnorm_all_binconf_fig_combined_figS2a.update_yaxes(range = [0.25, 3.5])
mutnorm_all_binconf_fig_combined_figS2a.update_xaxes(range = [-255,255], tickmode = 'array', tickvals = list(range(-250,300,50)), col = 1)
mutnorm_all_binconf_fig_combined_figS2a.update_xaxes(range = [-155,155], tickmode = 'array', tickvals = list(range(-150,200,50)), col = 2)
mutnorm_all_binconf_fig_combined_figS2a.update_xaxes(range = [-155,155], tickmode = 'array', tickvals = list(range(-150,200,50)), col = 3)
mutnorm_all_binconf_fig_combined_figS2a.update_layout(height = 650, width = 650, margin = dict(l = 55, r = 20, b = 45, t = 30), legend=dict(y = 1.05, x = -0.1, orientation='h', itemwidth = 30))
mutnorm_all_binconf_fig_combined_figS2a.show()

In [None]:
mutnorm_all_binconf_fig_combined_figS2a.write_image('./plots/revision_ACcor_flanking_mutation_fig_S2a.png', format='png', scale = 10, engine = 'orca')

#### Combination plot for Supplementary Figure S2B <a name="mutation_surrounding_plot_figS2B"></a>

[Return to Table of Contents](#TOC)

In [None]:
# Generate Supplementary Figure S2B

mutnorm_all_binconf_fig_combined_figS2b = make_subplots(rows = 5, cols = 2, shared_xaxes = True, vertical_spacing=0.02, horizontal_spacing = 0.11)

counter = 0
motif = 'IR'
for category, name in zip(['long_IRs', 'very_long_IRs', 'long_IRs_loop10', 'AT80_IRs', 'AT40_IRs'], ['IRs (len>80%)', 'IRs (len>95%)', 'IRs (loop<11nt)', 'IRs (GC<20%)', 'IRs (GC>60%)']):
    counter +=1
    plot_flank_add(motif = motif, category = category, display_name = name, plot_name = mutnorm_all_binconf_fig_combined_figS2b, row_n = counter, col_n = 1, showleg = True if counter == 1 else False)
    mutnorm_all_binconf_fig_combined_figS2b.update_yaxes(range = [0.1,2.6], row = counter, col = 1)
    
counter = 0
motif = 'MR'
for category, name in zip(['long_MRs', 'very_long_MRs', 'long_MRs_loop10', 'homopurine_MRs', 'non_homopurine_MRs'], ['MRs (len>80%)', 'MRs (len>95%)', 'MRs (loop<11nt)', 'MRs (homopurine)', 'MRs (non-homopurine)']):
    counter +=1
    plot_flank_add(motif = motif, category = category, display_name = name, plot_name = mutnorm_all_binconf_fig_combined_figS2b, row_n = counter, col_n = 2)
#    mutnorm_all_binconf_fig_combined_figS2b.add_trace(go.Scatter(showlegend = True if counter ==1 else False, x = nonbdb_flanking_all['simple_freq'][motif][nonb_category].index, y = nonbdb_flanking_all['simple_freq'][motif][nonb_category], marker = dict(color = 'rgba(80,80,80,0.2)'), name = 'simple', connectgaps = True, legendgroup = 'bad'), row = counter, col = 2)
#    mutnorm_all_binconf_fig_combined_figS2b.add_trace(go.Scatter(showlegend = True if counter ==1 else False, x = nonbdb_flanking_all['tri_norm'][motif][nonb_category][0][-np.inf].index, y = nonbdb_flanking_all['tri_norm'][motif][nonb_category][0][-np.inf], marker = dict(color = 'rgba(50,50,50,0.4)'), name = 'uncorrected', connectgaps = True, legendgroup = 'bad'), row = counter, col = 2)
    mutnorm_all_binconf_fig_combined_figS2b.update_yaxes(range = [0.1,2.6], row = counter, col = 2)
    
mutnorm_all_binconf_fig_combined_figS2b.update_xaxes(range = [-255,255], tickmode = 'array', tickvals = list(range(-250,300,50)))
mutnorm_all_binconf_fig_combined_figS2b.update_layout(height = 650, width = 650, margin = dict(l = 55, r = 20, b = 45, t = 30), legend=dict(y = 1.05, x = -0.1, orientation='h', itemwidth = 30))
mutnorm_all_binconf_fig_combined_figS2b.show()

In [None]:
mutnorm_all_binconf_fig_combined_figS2b.write_image('./plots/revision_ACcor_flanking_mutation_fig_S2b.png', format='png', scale = 10, engine = 'orca')

#### Plot for Supplementary Figure S2C <a name="mutation_surrounding_plot_figS2C"></a>

[Return to Table of Contents](#TOC)

In [None]:
# Generate Supplementary Figure S2C

mutnorm_all_binconf_fig_combined_figS2c = make_subplots(rows = 5, cols = 2, shared_xaxes = True, vertical_spacing=0.02, horizontal_spacing = 0.11)

counter = 0
for motif, category, name in zip(['G4', 'G4', 'ZDNA', 'ZDNA', 'ZDNA'], ['K_G4s', 'PDS_G4s', 'all_ZDNAs', 'long_ZDNAs', 'ZDNAs_GY'], ['K+ G4', 'PDS G4', 'ZDNA', 'ZDNA (len>80%)', 'Z-DNA (GY motif)']):
    counter +=1
    plot_flank_add(motif = motif, category = category, display_name = name, plot_name = mutnorm_all_binconf_fig_combined_figS2c, row_n = counter, col_n = 1, showleg = True if counter == 1 else False)
    mutnorm_all_binconf_fig_combined_figS2c.update_yaxes(range = [0.1,2.6], row = counter, col = 1)
    
counter = 0
for motif, category, name in zip(['DR', 'DR', 'DR', 'DR', 'random'], ['long_DRs', 'very_long_DRs', 'perfect_long_DRs', 'long_DRs_loop10', 'random'], ['DRs (len>80%)', 'DRs (len>95%)', 'DRs (no int.)', 'DRs (loop<11nt)', 'random']):
    counter +=1
    plot_flank_add(motif = motif, category = category, display_name = name, plot_name = mutnorm_all_binconf_fig_combined_figS2c, row_n = counter, col_n = 2)
#    mutnorm_all_binconf_fig_combined_figS2c.add_trace(go.Scatter(showlegend = True if counter ==1 else False, x = nonbdb_flanking_all['simple_freq'][motif][nonb_category].index, y = nonbdb_flanking_all['simple_freq'][motif][nonb_category], marker = dict(color = 'rgba(80,80,80,0.2)'), name = 'simple', connectgaps = True, legendgroup = 'bad'), row = counter, col = 2)
#    mutnorm_all_binconf_fig_combined_figS2c.add_trace(go.Scatter(showlegend = True if counter ==1 else False, x = nonbdb_flanking_all['tri_norm'][motif][nonb_category][0][-np.inf].index, y = nonbdb_flanking_all['tri_norm'][motif][nonb_category][0][-np.inf], marker = dict(color = 'rgba(50,50,50,0.4)'), name = 'uncorrected', connectgaps = True, legendgroup = 'bad'), row = counter, col = 2)
    mutnorm_all_binconf_fig_combined_figS2c.update_yaxes(range = [0.1,2.6], row = counter, col = 2)
    
mutnorm_all_binconf_fig_combined_figS2c.update_xaxes(range = [-255,255], tickmode = 'array', tickvals = list(range(-250,300,50)))
mutnorm_all_binconf_fig_combined_figS2c.update_layout(height = 650, width = 650, margin = dict(l = 55, r = 20, b = 45, t = 30), legend=dict(y = 1.05, x = -0.1, orientation='h', itemwidth = 30))
mutnorm_all_binconf_fig_combined_figS2c.show()

In [None]:
mutnorm_all_binconf_fig_combined_figS2c.write_image('./plots/revision_ACcor_flanking_mutation_fig_S2c.png', format='png', scale = 10, engine = 'orca')

#### Plot Supplementary Figure S2D <a name="mutation_surrounding_CGI_plot"></a>

[Return to Table of Contents](#TOC)

In [None]:
# Color scale for nonBdb plots
nonbdb_colors = make_default_colors(['simple freq', 'repeat-masked', 'tri-normalized', 'red', 'no overlaps', 'fully-corrected'], opacity=0.5)

In [None]:
# Function to generate individual plot traces
def nonb_plot_flank_add(motif, category, correct_cat, display_name, plot_name, row_n, col_n, showleg = False):
    plot_name.add_trace(go.Scatter(name = 'uncorr.', legendgroup = 'simple_freq', showlegend = showleg, x = nonbdb_flanking_all['simple_freq'][motif][category].index, y = nonbdb_flanking_all['simple_freq'][motif][category] / normtorandom_simplefreq, mode = 'lines', line = dict(width = 2, color = nonbdb_colors[1]['simple freq'])), row=row_n, col=col_n)
    plot_name.add_trace(go.Scatter(name = 'Repeatmasker', legendgroup = 'no_RM', showlegend = showleg, x = nonbdb_flanking_all['no_RM'][motif][category].index, y = nonbdb_flanking_all['no_RM'][motif][category] / normtorandom_simplefreq, mode = 'lines', line = dict(width = 2, color = nonbdb_colors[1]['repeat-masked'])), row=row_n, col=col_n)
    plot_name.add_trace(go.Scatter(name = 'tri-normalized', legendgroup = 'tri_norm', showlegend = showleg, x = nonbdb_flanking_all['tri_norm'][motif][category][0][-np.inf].index, y = nonbdb_flanking_all['tri_norm'][motif][category][0][-np.inf], mode = 'lines', line = dict(width = 2, color = nonbdb_colors[1]['tri-normalized'])), row=row_n, col=col_n)
    plot_name.add_trace(go.Scatter(name = 'unique', legendgroup = 'no_overlaps', showlegend = showleg, x = nonbdb_flanking_all['no_overlaps'][motif][category][0][-np.inf].index, y = nonbdb_flanking_all['no_overlaps'][motif][category][0][-np.inf], mode = 'lines', line = dict(width = 2, color = nonbdb_colors[1]['no overlaps'])), row=row_n, col=col_n)
    plot_name.add_trace(go.Scatter(name = 'full corr.', legendgroup = 'corrected', showlegend = showleg, x = flanking_norm_all[motif][correct_cat][0][-2.774].index, y = flanking_norm_all[motif][correct_cat][0][-2.774], mode = 'lines', line = dict(width = 2, color = nonbdb_colors[1]['fully-corrected'])), row=row_n, col=col_n)

    plot_name.add_shape(type='line', x0=-500, y0=1, x1=500, y1=1, line=dict(color='Black', width = .3), row=row_n, col=col_n)
    plot_name.update_yaxes(zeroline = False, title = dict(text = display_name, font = dict(size = 18 if len(display_name) < 10 else round(18* (10/len(display_name))))), title_standoff = 0, row = row_n, col = col_n)

# Function to generate individual plots
def nonb_plot_flank(motif, category, correct_cat):
    mutnorm_all_binconf_fig = make_subplots()
    nonb_plot_flank_add(motif = motif, category = category, correct_cat = correct_cat, display_name = motif + ' ' + category, plot_name = mutnorm_all_binconf_fig, row_n = 1, col_n = 1, showleg = True)
    mutnorm_all_binconf_fig.update_xaxes(range = [-250,250])
    mutnorm_all_binconf_fig.update_yaxes(range = [0.1,3.1], zeroline = False)
    return mutnorm_all_binconf_fig

In [None]:
# Generate Supplementary Figure S2D

mutnorm_all_fig_combined_figS2d = make_subplots(rows = 5, cols = 2, shared_xaxes = True, vertical_spacing=0.02, horizontal_spacing = 0.11)

counter = 0
motif = 'STR'
for category in ['A', 'C', 'AT', 'AC', 'AG']:
    counter +=1
    nonb_plot_flank_add(motif = motif, correct_cat = category, category=category, display_name = category, plot_name = mutnorm_all_fig_combined_figS2d, row_n = counter, col_n = 1, showleg = True if counter == 1 else False)
    
counter = 0
for motif, category, nonb_category, name in zip(['IR', 'DR', 'MR', 'ZDNA', 'G4'], ['long_IRs', 'long_DRs', 'long_MRs', 'all_ZDNAs', 'K_G4s'], ['long', 'long', 'long', 'all', 'all'], ['Inverted', 'Direct', 'Mirror', 'Z-DNA', 'G4']):
    counter +=1
    nonb_plot_flank_add(motif = motif, correct_cat = category, category=nonb_category, display_name = name, plot_name = mutnorm_all_fig_combined_figS2d, row_n = counter, col_n = 2)

mutnorm_all_fig_combined_figS2d.update_yaxes(range = [0.1,4.5])
mutnorm_all_fig_combined_figS2d.update_yaxes(range = [0.1,7.9], row = 1, col = 1)
mutnorm_all_fig_combined_figS2d.update_yaxes(range = [0.1,10.9], row = 2, col = 1)
mutnorm_all_fig_combined_figS2d.update_yaxes(range = [0.49,1.75], row = 1, col = 2)

    
mutnorm_all_fig_combined_figS2d.update_xaxes(range = [-255,255], tickmode = 'array', tickvals = list(range(-250,300,50)))
mutnorm_all_fig_combined_figS2d.update_layout(height = 650, width = 650, margin = dict(l = 55, r = 20, b = 45, t = 30), legend=dict(y = 1.05, x = -0.1, orientation='h', itemwidth = 30))
mutnorm_all_fig_combined_figS2d.show()

In [None]:
mutnorm_all_fig_combined_figS2d.write_image('./plots/revision_ACcor_flanking_mutation_fig_S2d.png', format='png', scale = 10, engine = 'orca')

## Mutation frequency with gnomAD subsampling <a name="mutation_surrounding_subsample"></a>

[Return to Table of Contents](#TOC)

#### Plot Supplementary Figure S2E <a name="mutation_surrounding_subsample_plot"></a>

[Return to Table of Contents](#TOC)

In [None]:
mutnorm_perfect_downsample_fig_combined_figS2e = make_subplots(rows = 5, cols = 4, vertical_spacing = 0.025, horizontal_spacing = 0.025, shared_yaxes = True, shared_xaxes = True)

counter = 0
motif = 'STR'
for category in ['A', 'C', 'AT', 'AC', 'AG']:
    counter +=1
    plot_flank_add(motif = motif, category = category, input_dict = flanking_norm_all, display_name = category, plot_name = mutnorm_perfect_downsample_fig_combined_figS2e, row_n = counter, col_n = 1, showleg = True if counter == 1 else False)

counter = 0
for category in ['A', 'C', 'AT', 'AC', 'AG']:
    counter +=1
    plot_flank_add(motif = motif, category = category, input_dict = flanking_norm_subsample[0.1], display_name = category, plot_name = mutnorm_perfect_downsample_fig_combined_figS2e, row_n = counter, col_n = 2)

counter = 0
for category in ['A', 'C', 'AT', 'AC', 'AG']:
    counter +=1
    plot_flank_add(motif = motif, category = category, input_dict = flanking_norm_subsample[0.01], display_name = category, plot_name = mutnorm_perfect_downsample_fig_combined_figS2e, row_n = counter, col_n = 3)
    
counter = 0
for category in ['A', 'C', 'AT', 'AC', 'AG']:
    counter +=1
    plot_flank_add(motif = motif, category = category, input_dict = flanking_norm_subsample[0.004], display_name = category, plot_name = mutnorm_perfect_downsample_fig_combined_figS2e, row_n = counter, col_n = 4)

mutnorm_perfect_downsample_fig_combined_figS2e.update_yaxes(range = [0.25,4.25])

mutnorm_perfect_downsample_fig_combined_figS2e.update_xaxes(title = dict(text = 'no subsample', standoff = 0, font = dict(size = 16)), row = 5, col = 1)
mutnorm_perfect_downsample_fig_combined_figS2e.update_xaxes(title = dict(text = 'subsample 1/10', standoff = 0, font = dict(size = 16)), row = 5, col = 2)
mutnorm_perfect_downsample_fig_combined_figS2e.update_xaxes(title = dict(text = 'subsample 1/100', standoff = 0, font = dict(size = 16)), row = 5, col = 3)
mutnorm_perfect_downsample_fig_combined_figS2e.update_xaxes(title = dict(text = 'subsample 1/250', standoff = 0, font = dict(size = 16)), row = 5, col = 4)
                
for cols in range(1,5):
    for rows in range(1,6):
        mutnorm_perfect_downsample_fig_combined_figS2e.add_shape(type='line', x0=-100, y0=1, x1=250, y1=1, line=dict(color='Black', width = .3), row = rows, col = cols)
mutnorm_perfect_downsample_fig_combined_figS2e.update_xaxes(range = [-100,100], tickmode = 'array', tickvals = [-100, -50, 0, 50, 100])
mutnorm_perfect_downsample_fig_combined_figS2e.update_layout(width = 800, height = 500, margin = dict(l = 35, r = 25, b = 0, t = 20), legend=dict(y = -0.085, x = 0.2, orientation='h'))
mutnorm_perfect_downsample_fig_combined_figS2e.show()

In [None]:
mutnorm_perfect_downsample_fig_combined_figS2e.write_image('./plots/revision_ACcor_flanking_mutation_fig_S2E.png', format='png', scale = 10, engine = 'orca')

### Plot de novo flanking mutagenesis

In [None]:
mutnorm_denovo_figS2f = make_subplots(rows = 5, cols = 1, vertical_spacing = 0.025, horizontal_spacing = 0.025, shared_yaxes = True, shared_xaxes = True)

counter = 0
motif = 'STR'
for category in ['A', 'AT', 'AC', 'AG']:
    counter +=1
    plot_flank_add(motif = motif, category = category, input_dict = flanking_norm_all, display_name = category, plot_name = mutnorm_denovo_figS2f, row_n = counter, col_n = 1, showleg = True if counter == 1 else False)
    plot_flank_add_denovo(motif = motif, category = category, input_dict = flanking_norm_denovo, display_name = category, plot_name = mutnorm_denovo_figS2f, row_n = counter, col_n = 1, showleg = True if counter == 1 else False)
counter +=1
plot_flank_add(motif = 'G4', category = 'K_G4s', input_dict = flanking_norm_all, display_name = 'K+ G4', plot_name = mutnorm_denovo_figS2f, row_n = counter, col_n = 1, showleg = True if counter == 1 else False)
plot_flank_add_denovo(motif = 'G4', category = 'K_G4s', input_dict = flanking_norm_denovo, display_name = 'K+ G4', plot_name = mutnorm_denovo_figS2f, row_n = counter, col_n = 1, showleg = True if counter == 1 else False)

mutnorm_denovo_figS2f.update_yaxes(range = [0,4.25])

mutnorm_denovo_figS2f.update_xaxes(range = [-100,100], tickmode = 'array', tickvals = [-100, -50, 0, 50, 100])
mutnorm_denovo_figS2f.update_layout(width = 800, height = 600, margin = dict(l = 35, r = 25, b = 20, t = 20), legend=dict(y = -0.03, x = 0.2, orientation='h'))
mutnorm_denovo_figS2f.show()

In [None]:
mutnorm_denovo_figS2f.write_image('./plots/revision_flanking_mutation_denovo_fig_S2F.png', format='png', scale = 10, engine = 'orca')

## Mutation frequency per-mutation-type surrounding STRs  <a name="mutation_surrounding_bymut"></a>

[Return to Table of Contents](#TOC)

In [None]:
# Assign plot colors for mutation types
colors['ind_F'] = [mut for mut in triplet_mutations_und_TC if (mut[1] == 'C') & (mut[4] == 'A')], [mut for mut in triplet_mutations_und_TC if (mut[1] == 'C') & (mut[4] == 'G')], [mut for mut in triplet_mutations_und_TC if (mut[1] == 'C') & (mut[4] == 'T')], [mut for mut in triplet_mutations_und_TC if (mut[1] == 'T') & (mut[4] == 'A')], [mut for mut in triplet_mutations_und_TC if (mut[1] == 'T') & (mut[4] == 'C')], [mut for mut in triplet_mutations_und_TC if (mut[1] == 'T') & (mut[4] == 'G')]
colors['ind_RC'] = [[reverse_complement_mut(mut) for mut in triplets] for triplets in colors['ind_F']]

In [None]:
# Calculate normalized frequencies per mutation type for random sequences
norm_random_bymut = dict()
for mut_type in colors.index:
    norm_random_bymut[mut_type] = mut_norm_conf(count_random_all, tri_subset = colors['ind_F'][mut_type], do_binconf = False, gc_correct = True)
    norm_random_bymut[reverse_complement_mut(mut_type)] = mut_norm_conf(count_random_all, tri_subset = colors['ind_RC'][mut_type], do_binconf = False, gc_correct = True)

In [None]:
# Median SNV frequencies for each QC filter, used to normalize all later mutation frequencies
normtorandom_bymut = dict()
for mut_type in list(norm_random_bymut):
    normtorandom_bymut[mut_type] = pd.Series([norm_random_bymut[mut_type][0][qc_cutoff].loc[-50:50].median() for qc_cutoff in norm_random_bymut[mut_type][0]], index = list(norm_random_bymut[mut_type][0]))

In [None]:
# Calculate normalized frequencies per mutation type for STRs
norm_long_STRs_bymut = dict()
for repeat in ['A', 'C', 'AT', 'AC', 'AG']:
    norm_long_STRs_bymut[repeat] = dict()
    print('\r' + '                               ' + repeat, end='  ')
    for mut_type in colors.index:
        norm_long_STRs_bymut[repeat][mut_type] = mut_norm_conf(count_long_STRs[repeat], normtorandom = True, random_normaverage = normtorandom_bymut[mut_type], tri_subset = colors['ind_F'][mut_type], do_binconf = False, gc_correct = True)
        norm_long_STRs_bymut[repeat][reverse_complement_mut(mut_type)] = mut_norm_conf(count_long_STRs[repeat], normtorandom = True, random_normaverage = normtorandom_bymut[mut_type], tri_subset = colors['ind_RC'][mut_type], do_binconf = False, gc_correct = True)

In [None]:
# Save temporary output of the mutation counts
flanking_norm_bymut = dict()
flanking_norm_bymut['STR'] = norm_long_STRs_bymut.copy()
flanking_norm_bymut['random'] = norm_random_bymut.copy()

with open('./analysis/temp/flank_norm_ACcor_bymut_chr'+str(chr_range)+'-22.pickle', 'wb') as handle:
    pickle.dump(flanking_norm_bymut, handle, protocol=pickle.HIGHEST_PROTOCOL)

In [None]:
# Load temporary output of the mutation counts
with open('./analysis/temp/flank_norm_ACcor_bymut_chr'+str(chr_range)+'-22.pickle', 'rb') as handle:
    flanking_norm_bymut = pickle.load(handle)

# Median SNV frequencies for each QC filter, used to normalize all later mutation frequencies
normtorandom_bymut = dict()
for mut_type in list(flanking_norm_bymut['random']):
    normtorandom_bymut[mut_type] = pd.Series([flanking_norm_bymut['random'][mut_type][0][qc_cutoff].loc[-50:50].median() for qc_cutoff in flanking_norm_bymut['random'][mut_type][0]], index = list(flanking_norm_bymut['random'][mut_type][0]))

#### Plot Supplementary Figure S4 <a name="mutation_surrounding_bymut_plot"></a>

[Return to Table of Contents](#TOC)

In [None]:
# Function to generate individual plot traces
def plot_flank_add_bymut(motif, category, display_name, plot_name, row_n, col_n, input_dict, QCfilter, showleg = False):
    for mut_type in colors.index:
        plot_name.add_trace(go.Scatter(name = mut_type.replace('_', '>'), legendgroup = mut_type, showlegend = showleg, x = input_dict[motif][category][mut_type][0][QCfilter].index, y = input_dict[motif][category][mut_type][0][QCfilter], mode = 'lines', line = dict(width = 2, color = colors['color'][mut_type]), connectgaps=True), row=row_n, col=col_n)
        plot_name.add_trace(go.Scatter(name = reverse_complement_mut(mut_type).replace('_', '>'), legendgroup = mut_type, showlegend = showleg, x = input_dict[motif][category][reverse_complement_mut(mut_type)][0][QCfilter].index, y = input_dict[motif][category][reverse_complement_mut(mut_type)][0][QCfilter], mode = 'lines', line = dict(width = 2, dash = 'dot', color = colors['color'][mut_type]), connectgaps=True), row=row_n, col=col_n)
    plot_name.add_shape(type='line', x0=-500, y0=1, x1=500, y1=1, line=dict(color='Black', width = .3), row=row_n, col=col_n)
    plot_name.update_yaxes(zeroline = False, title = dict(text = display_name, font = dict(size = 18 if len(display_name) < 10 else round(18* (10/len(display_name))))), title_standoff = 0, row = row_n, col = col_n)

# Function to generate individual plots
def plot_flank_bymut(motif, category, QCfilter, input_dict = flanking_norm_bymut):
    mutnorm_all_binconf_fig = make_subplots()
    plot_flank_add_bymut(motif = motif, category = category, display_name = motif + ' ' + category, plot_name = mutnorm_all_binconf_fig, row_n = 1, col_n = 1, input_dict = input_dict, QCfilter = QCfilter, showleg = True)
    mutnorm_all_binconf_fig.update_xaxes(range = [-50,50])
    mutnorm_all_binconf_fig.update_yaxes(range = [0.1,15.1], zeroline = False)
    return mutnorm_all_binconf_fig

In [None]:
mutnorm_perfect_bymut_fig_combined_figS2g = make_subplots(rows = 5, cols = 4, vertical_spacing = 0.025, horizontal_spacing = 0.025, shared_yaxes = True, shared_xaxes = True)

counter = 0
motif = 'STR'
for category in ['A', 'C', 'AT', 'AC', 'AG']:
    counter +=1
    plot_flank_add_bymut(motif = motif, category = category, QCfilter = -np.inf, input_dict = flanking_norm_bymut, display_name = category, plot_name = mutnorm_perfect_bymut_fig_combined_figS2g, row_n = counter, col_n = 1, showleg = True if counter == 1 else False)

counter = 0
for category in ['A', 'C', 'AT', 'AC', 'AG']:
    counter +=1
    plot_flank_add_bymut(motif = motif, category = category, QCfilter = -2.774, input_dict = flanking_norm_bymut, display_name = category, plot_name = mutnorm_perfect_bymut_fig_combined_figS2g, row_n = counter, col_n = 2)

counter = 0
for category in ['A', 'C', 'AT', 'AC', 'AG']:
    counter +=1
    plot_flank_add_bymut(motif = motif, category = category, QCfilter = 0, input_dict = flanking_norm_bymut, display_name = category, plot_name = mutnorm_perfect_bymut_fig_combined_figS2g, row_n = counter, col_n = 3)
    
counter = 0
for category in ['A', 'C', 'AT', 'AC', 'AG']:
    counter +=1
    plot_flank_add_bymut(motif = motif, category = category, QCfilter = 4, input_dict = flanking_norm_bymut, display_name = category, plot_name = mutnorm_perfect_bymut_fig_combined_figS2g, row_n = counter, col_n = 4)

mutnorm_perfect_bymut_fig_combined_figS2g.update_yaxes(range = [0.25,10.2], row = 1)
mutnorm_perfect_bymut_fig_combined_figS2g.update_yaxes(range = [0.25,5.1], row = 2)
mutnorm_perfect_bymut_fig_combined_figS2g.update_yaxes(range = [0.25,8.2], row = 3)
mutnorm_perfect_bymut_fig_combined_figS2g.update_yaxes(range = [0.25,5.1], row = 4)
mutnorm_perfect_bymut_fig_combined_figS2g.update_yaxes(range = [0.25,5.1], row = 5)

mutnorm_perfect_bymut_fig_combined_figS2g.update_xaxes(title = dict(text = 'no QC', standoff = 0, font = dict(size = 16)), row = 5, col = 1)
mutnorm_perfect_bymut_fig_combined_figS2g.update_xaxes(title = dict(text = 'pass', standoff = 0, font = dict(size = 16)), row = 5, col = 2)
mutnorm_perfect_bymut_fig_combined_figS2g.update_xaxes(title = dict(text = 'VQSLOD>0', standoff = 0, font = dict(size = 16)), row = 5, col = 3)
mutnorm_perfect_bymut_fig_combined_figS2g.update_xaxes(title = dict(text = 'VQSLOD>4', standoff = 0, font = dict(size = 16)), row = 5, col = 4)
                
for cols in range(1,5):
    for rows in range(1,6):
        mutnorm_perfect_bymut_fig_combined_figS2g.add_shape(type='line', x0=-100, y0=1, x1=250, y1=1, line=dict(color='Black', width = .3), row = rows, col = cols)
mutnorm_perfect_bymut_fig_combined_figS2g.update_xaxes(range = [-25,25], tickmode = 'array', tickvals = [-20, -10, 0, 10, 20])
mutnorm_perfect_bymut_fig_combined_figS2g.update_layout(width = 800, height = 800, margin = dict(l = 35, r = 25, b = 0, t = 20), legend=dict(y = -0.085, x = 0.2, orientation='h'))
mutnorm_perfect_bymut_fig_combined_figS2g.show()

In [None]:
mutnorm_perfect_bymut_fig_combined_figS2g.write_image('./plots/revision_ACcor_flanking_mutation_fig_S2G.png', format='png', scale = 10, engine = 'orca')

# Indels flanking motifs <a name="indel_analysis_flank"></a>

[Return to Table of Contents](#TOC)

#### Define counting functions for indels

In [None]:
def count_indel_flank_chrom(chrom, current_pos_chrom, input_mut_dict, distance, left_pos_col, right_pos_col, filter_cols, filter_distance, useful_cols, vqslod_list_indel, indel_types):
    print('\r' + str(chrom), end='        ')

    # filter out motifs too close to masked elements
    current_pos_chrom = current_pos_chrom.loc[current_pos_chrom[[col + '_min' for col in filter_cols]].fillna(0).min(axis=1) > filter_distance].copy()
  
    # make a single dataframe consisting of all search coordinates and their position relative to the original starts/ends of interest
    
    search_positions_left = dict()
    for pos in range(-distance,0):
        search_positions_left[pos] = pd.DataFrame(current_pos_chrom[left_pos_col] + pos)
    search_positions_left = pd.concat(search_positions_left).reset_index()[[left_pos_col, 'level_0']]
    for col in useful_cols:
        search_positions_left[col] = list(current_pos_chrom[col])*distance
    if len(filter_cols) > 0:
        search_positions_left['filter_left'] = list(current_pos_chrom[[col + '_left' for col in filter_cols]].min(axis=1))*distance
        search_positions_left = search_positions_left.loc[search_positions_left['level_0'].abs() <= (search_positions_left['filter_left'] - filter_distance)].copy()
        del search_positions_left['filter_left']
    search_positions_left.columns = ['POS_mid', 'relative_pos'] + useful_cols

    search_positions_right = dict()
    for pos in range(1,distance+1):
        search_positions_right[pos] = pd.DataFrame(current_pos_chrom[right_pos_col] + pos -1)
    search_positions_right = pd.concat(search_positions_right).reset_index()[[right_pos_col, 'level_0']]
    for col in useful_cols:
        search_positions_right[col] = list(current_pos_chrom[col])*distance
    if len(filter_cols) > 0:
        search_positions_right['filter_right'] = list(current_pos_chrom[[col + '_right' for col in filter_cols]].min(axis=1))*distance
        search_positions_right = search_positions_right.loc[search_positions_right['level_0'] <= (search_positions_right['filter_right'] - filter_distance)].copy()
        del search_positions_right['filter_right']
    search_positions_right.columns = ['POS_mid', 'relative_pos'] + useful_cols

    for col in filter_cols:
        current_pos_chrom = current_pos_chrom.loc[current_pos_chrom[col + '_min'] >= filter_distance].copy()
    current_pos_chrom['pos'] = [list(range(left,right)) for left, right in zip(current_pos_chrom[left_pos_col], current_pos_chrom[right_pos_col])]
    for col in useful_cols:
        current_pos_chrom[col] = [[entry]*len(pos) for entry, pos in zip(current_pos_chrom[col], current_pos_chrom['pos'])]
    search_positions_middle = pd.DataFrame(flatten(current_pos_chrom['pos']), columns = ['POS_mid'])
    search_positions_middle['relative_pos'] = 0
    for col in useful_cols:
        search_positions_middle[col] = pd.DataFrame(flatten(current_pos_chrom[col]))

    search_positions = pd.concat([search_positions_left, search_positions_right, search_positions_middle])
        
    # for each qc_cutoff in the mutation dataset, find mutations overlapping the search coordinates
    current_mut_count = dict()
    current_mut_chrom = input_mut_dict[chrom].copy()
    current_mut_chrom.index = current_mut_chrom.index - 1     # change coordinates from base1 to base0

    print('\r' + str(chrom), end='        ')
    current_mut_count = current_mut_chrom.reindex(search_positions['POS_mid'])
    current_mut_count['pos'] = list(search_positions['relative_pos'])
    current_mut_count['Tri'] = [tri_function(chrom, pos, base = 0) for pos in current_mut_count.index]

    for col in useful_cols:
        current_mut_count[col] = list(search_positions[col])
    current_mut_count['tri_count'] = 1    
    current_mut_count = current_mut_count.loc[current_mut_count['Tri'].isin(all_triplets)]
    
    # count mutations at each relative position using groupby
    current_mut_sum_group = current_mut_count.groupby(['pos'] + useful_cols + ['Tri']).sum()
    current_mut_sum = dict()
    for qc_cutoff in vqslod_list_indel:
        current_mut_sum[qc_cutoff] = current_mut_sum_group[[qc_cutoff, 'tri_count']].astype(int)
        current_mut_sum[qc_cutoff].columns = indel_types + ['tri_count']
        
    return current_mut_sum

def count_indel_flank(input_pos_df, input_mut_dict = variants_indel_slim_AC, distance = 500, left_pos_col = 'start', right_pos_col = 'end', chrom_col = 'chrom', strand_col = 'Strand', strand_names = ('+', '-'), filter_cols = ['RM_distance'], filter_distance = 20, useful_cols = [], indel_types = ['del', 'ins']):
   
    current_mut_sum_chrom = dict()
    for chrom in range(chr_range,23):
        current_mut_sum_chrom[chrom] = count_indel_flank_chrom(chrom, input_pos_df.loc[input_pos_df[chrom_col] == chrom], input_mut_dict, distance = distance, left_pos_col = left_pos_col, right_pos_col = right_pos_col, filter_cols = filter_cols, filter_distance = filter_distance, useful_cols = useful_cols, vqslod_list_indel = vqslod_list_indel, indel_types = indel_types)
    
    current_mut_sum = dict()
    for qc_cutoff in vqslod_list_indel:
        current_mut_sum[qc_cutoff] = pd.concat([current_mut_sum_chrom[chrom][qc_cutoff] for chrom in range(chr_range,23)])
        
    # apply reverse complement to - strand triplets, mutation counts and positions          
    
    if strand_col in useful_cols:
        useful_cols.remove(strand_col)
        current_mut_sum_strand_F = dict()
        current_mut_sum_strand_R = dict()
        current_mut_sum_strand_other = dict()
        current_mut_sum_bothstrands = dict()
        for qc_cutoff in current_mut_sum:
            current_mut_sum_strand_other[qc_cutoff] = current_mut_sum[qc_cutoff].loc[~current_mut_sum[qc_cutoff].index.get_level_values(strand_col).isin(strand_names)]
            current_mut_sum_strand_other[qc_cutoff] = current_mut_sum_strand_other[qc_cutoff].reset_index().groupby(['pos'] + useful_cols + ['Tri']).sum()
            
            current_mut_sum_strand_F[qc_cutoff] = current_mut_sum[qc_cutoff].loc[current_mut_sum[qc_cutoff].index.get_level_values(strand_col) == strand_names[0]]
            current_mut_sum_strand_F[qc_cutoff] = current_mut_sum_strand_F[qc_cutoff].reset_index().groupby(['pos'] + useful_cols + ['Tri']).sum()

            current_mut_sum_strand_R[qc_cutoff] = current_mut_sum[qc_cutoff].loc[current_mut_sum[qc_cutoff].index.get_level_values(strand_col) == strand_names[1]]
            current_mut_sum_strand_R[qc_cutoff] = current_mut_sum_strand_R[qc_cutoff].reset_index()
            current_mut_sum_strand_R[qc_cutoff]['Tri'] = current_mut_sum_strand_R[qc_cutoff]['Tri'].apply(reverse_complement)
            current_mut_sum_strand_R[qc_cutoff]['pos'] = -current_mut_sum_strand_R[qc_cutoff]['pos']
            current_mut_sum_strand_R[qc_cutoff] = current_mut_sum_strand_R[qc_cutoff].groupby(['pos'] + useful_cols + ['Tri']).sum()
            current_mut_sum_strand_R[qc_cutoff].columns = indel_types + ['tri_count']
            current_mut_sum_strand_R[qc_cutoff] = current_mut_sum_strand_R[qc_cutoff][indel_types + ['tri_count']]

            current_mut_sum_bothstrands[qc_cutoff] = current_mut_sum_strand_F[qc_cutoff].add(current_mut_sum_strand_R[qc_cutoff], fill_value = 0).add(current_mut_sum_strand_other[qc_cutoff], fill_value = 0).astype(int)
    else:
        current_mut_sum_bothstrands = dict()
        for qc_cutoff in current_mut_sum:
            current_mut_sum_bothstrands[qc_cutoff] = current_mut_sum[qc_cutoff].reset_index().groupby(['pos'] + useful_cols + ['Tri']).sum().fillna(0).astype(int)

    # reformat output to NNN_N rows x pos columns, and split mut counts and trinucleotide counts
    
    current_mut_sum_reformat = dict()
    for qc_cutoff in current_mut_sum:
        current_mut_sum_reformat[qc_cutoff] = dict()
        for mut in indel_types:
            current_mut_sum_reformat[qc_cutoff][mut] = current_mut_sum_bothstrands[qc_cutoff].unstack().transpose().fillna(0).astype(int).loc[mut]
            current_mut_sum_reformat[qc_cutoff][mut].index = [tri+'_'+mut for tri in current_mut_sum_reformat[qc_cutoff][mut].index]
        current_mut_sum_reformat[qc_cutoff] = pd.concat(current_mut_sum_reformat[qc_cutoff]).reset_index().groupby(['level_1']).sum().reindex(triplet_mutations_und_indel)
        current_mut_sum_reformat[qc_cutoff].index.name = 'Mut'

    current_tri_sum = current_mut_sum_bothstrands[qc_cutoff].unstack().transpose().fillna(0).astype(int).loc['tri_count'].reindex(all_triplets)

    return current_mut_sum_reformat.copy(), current_tri_sum.copy()

### Count indels <a name="indel_analysis_flank_count"></a>

[Return to Table of Contents](#TOC)

#### Random sequences

In [None]:
# Count indels (all)
count_random_indel = count_indel_flank(random_seq, left_pos_col = 'start', right_pos_col = 'end', filter_cols = ['RM_distance', 'STR_distance', 'nonSTR_distance', 'within_motif_distance'], useful_cols = ['Strand'])

norm_random_indel = dict()
norm_random_indel['ins'] = mut_norm_conf(count_random_indel, snvindel = 'indel', min_count = 10, genome_AC_freq_current = variants_indel_count_AC_freq_all, tri_subset = triplet_mutations_und_ins)
norm_random_indel['del'] = mut_norm_conf(count_random_indel, snvindel = 'indel', min_count = 10, genome_AC_freq_current = variants_indel_count_AC_freq_all, tri_subset = triplet_mutations_und_del)

normtorandom_indel = pd.DataFrame()
normtorandom_indel['ins'] = pd.Series([norm_random_indel['ins'][0][qc_cutoff].loc[-50:50].median() for qc_cutoff in norm_random_indel['ins'][0]], index = vqslod_list_indel)
normtorandom_indel['del'] = pd.Series([norm_random_indel['del'][0][qc_cutoff].loc[-50:50].median() for qc_cutoff in norm_random_indel['del'][0]], index = vqslod_list_indel)

In [None]:
# Count long/short indels
count_random_indel_long = count_indel_flank(random_seq, input_mut_dict = variants_ins_slim_AC_long, left_pos_col = 'start', right_pos_col = 'end', filter_cols = ['RM_distance', 'STR_distance', 'nonSTR_distance', 'within_motif_distance'], useful_cols = ['Strand'], indel_types = ['ins'])
norm_random_indel['ins_long'] = mut_norm_conf(count_random_indel_long, snvindel = 'indel', min_count = 10, genome_AC_freq_current = variants_ins_count_AC_freq_long, tri_subset = triplet_mutations_und_ins)
normtorandom_indel['ins_long'] = pd.Series([norm_random_indel['ins_long'][0][qc_cutoff].loc[-50:50].median() for qc_cutoff in norm_random_indel['ins_long'][0]], index = vqslod_list_indel)

count_random_indel_short = count_indel_flank(random_seq, input_mut_dict = variants_ins_slim_AC_short, left_pos_col = 'start', right_pos_col = 'end', filter_cols = ['RM_distance', 'STR_distance', 'nonSTR_distance', 'within_motif_distance'], useful_cols = ['Strand'], indel_types = ['ins'])
norm_random_indel['ins_short'] = mut_norm_conf(count_random_indel_short, snvindel = 'indel', min_count = 10, genome_AC_freq_current = variants_ins_count_AC_freq_short, tri_subset = triplet_mutations_und_ins)
normtorandom_indel['ins_short'] = pd.Series([norm_random_indel['ins_short'][0][qc_cutoff].loc[-50:50].median() for qc_cutoff in norm_random_indel['ins_short'][0]], index = vqslod_list_indel)

In [None]:
# Assign color scheme for indel plots
QC_colors_indel = make_colorscale(vqslod_list_indel)
QC_colors_indel = pd.DataFrame(QC_colors_indel).transpose()
QC_colors_indel['name'] = ['no QC', 'pass', 'VQSLOD >0', 'VQSLOD >1.4']

#### Count STRs

In [None]:
count_STRs_indel = dict()
norm_STRs_indel = dict()
for repeat in repeats_highpower_asym[:11].index:
    print('\r' + '                     ' + repeat, end='  ')
    count_STRs_indel[repeat] = count_indel_flank(long_STRs[repeat], filter_cols = ['RM_distance', 'STR_distance', 'nonSTR_distance', 'within_motif_distance'], useful_cols = ['Strand'])
for repeat in repeats_highpower_sym:
    print('\r' + '                     ' + repeat, end='  ')
    count_STRs_indel[repeat] = count_indel_flank(long_STRs[repeat], filter_cols = ['RM_distance', 'STR_distance', 'nonSTR_distance', 'within_motif_distance'], useful_cols = [])
count_STRs_indel['GC_noCGI'] = count_indel_flank(STRs_CG, filter_cols = ['RM_distance', 'STR_distance', 'nonSTR_distance', 'within_motif_distance', 'CGI_distance'])
for repeat in count_STRs_indel:
    norm_STRs_indel[repeat] = dict()
    norm_STRs_indel[repeat]['ins'] = mut_norm_conf(count_STRs_indel[repeat], snvindel = 'indel', min_count = 10, genome_AC_freq_current = variants_indel_count_AC_freq_all, tri_subset = triplet_mutations_und_ins, normtorandom = True, random_normaverage = normtorandom_indel['ins'], do_binconf = True)
    norm_STRs_indel[repeat]['del'] = mut_norm_conf(count_STRs_indel[repeat], snvindel = 'indel', min_count = 10, genome_AC_freq_current = variants_indel_count_AC_freq_all, tri_subset = triplet_mutations_und_del, normtorandom = True, random_normaverage = normtorandom_indel['del'], do_binconf = True)

#### Count IRs

In [None]:
count_IRs_indel = dict()
norm_IRs_indel = dict()
for motif, name in zip([long_IRs, very_long_IRs, AT80_IRs, AT40_IRs, long_IRs_loop10], ['long_IRs', 'very_long_IRs', 'AT80_IRs', 'AT40_IRs', 'long_IRs_loop10']):
    print('\r' + '                     ' + name, end='  ')
    count_IRs_indel[name] = count_indel_flank(motif, filter_cols = ['RM_distance', 'STR_distance', 'nonSTR_distance', 'within_motif_distance'], useful_cols = [])
    norm_IRs_indel[name] = dict()
    norm_IRs_indel[name]['ins'] = mut_norm_conf(count_IRs_indel[name], snvindel = 'indel', min_count = 10, genome_AC_freq_current = variants_indel_count_AC_freq_all, tri_subset = triplet_mutations_und_ins, normtorandom = True, random_normaverage = normtorandom_indel['ins'], do_binconf = True)
    norm_IRs_indel[name]['del'] = mut_norm_conf(count_IRs_indel[name], snvindel = 'indel', min_count = 10, genome_AC_freq_current = variants_indel_count_AC_freq_all, tri_subset = triplet_mutations_und_del, normtorandom = True, random_normaverage = normtorandom_indel['del'], do_binconf = True)

#### Count MRs

In [None]:
count_MRs_indel = dict()
norm_MRs_indel = dict()
for motif, name in zip([long_MRs, very_long_MRs, non_homopurine_MRs, long_MRs_loop10], ['long_MRs', 'very_long_MRs', 'non_homopurine_MRs', 'long_MRs_loop10']):
    print('\r' + '                     ' + name, end='  ')
    count_MRs_indel[name] = count_indel_flank(motif, filter_cols = ['RM_distance', 'STR_distance', 'nonSTR_distance', 'within_motif_distance'], useful_cols = [])
for motif, name in zip([homopurine_MRs], ['homopurine_MRs']):
    print('\r' + '                     ' + name, end='  ')
    count_MRs_indel[name] = count_indel_flank(motif, filter_cols = ['RM_distance', 'STR_distance', 'nonSTR_distance', 'within_motif_distance'], useful_cols = ['Strand'])
for name in count_MRs_indel:
    norm_MRs_indel[name] = dict()
    norm_MRs_indel[name]['ins'] = mut_norm_conf(count_MRs_indel[name], snvindel = 'indel', min_count = 10, genome_AC_freq_current = variants_indel_count_AC_freq_all, tri_subset = triplet_mutations_und_ins, normtorandom = True, random_normaverage = normtorandom_indel['ins'], do_binconf = True)
    norm_MRs_indel[name]['del'] = mut_norm_conf(count_MRs_indel[name], snvindel = 'indel', min_count = 10, genome_AC_freq_current = variants_indel_count_AC_freq_all, tri_subset = triplet_mutations_und_del, normtorandom = True, random_normaverage = normtorandom_indel['del'], do_binconf = True)

#### Count DRs

In [None]:
count_DRs_indel = dict()
norm_DRs_indel = dict()
for motif, name in zip([long_DRs, very_long_DRs, perfect_long_DRs, long_DRs_loop10], ['long_DRs', 'very_long_DRs', 'perfect_long_DRs', 'long_DRs_loop10']):
    print('\r' + '                     ' + name, end='  ')
    count_DRs_indel[name] = count_indel_flank(motif, filter_cols = ['RM_distance', 'STR_distance', 'nonSTR_distance', 'within_motif_distance'], useful_cols = [])
    norm_DRs_indel[name] = dict()
    norm_DRs_indel[name]['ins'] = mut_norm_conf(count_DRs_indel[name], snvindel = 'indel', min_count = 10, genome_AC_freq_current = variants_indel_count_AC_freq_all, tri_subset = triplet_mutations_und_ins, normtorandom = True, random_normaverage = normtorandom_indel['ins'], do_binconf = True)
    norm_DRs_indel[name]['del'] = mut_norm_conf(count_DRs_indel[name], snvindel = 'indel', min_count = 10, genome_AC_freq_current = variants_indel_count_AC_freq_all, tri_subset = triplet_mutations_und_del, normtorandom = True, random_normaverage = normtorandom_indel['del'], do_binconf = True)

#### Count ZDNAs

In [None]:
count_ZDNAs_indel = dict()
norm_ZDNAs_indel = dict()
for motif, name in zip([all_ZDNAs, long_ZDNAs], ['all_ZDNAs', 'long_ZDNAs']):
    print('\r' + '                     ' + name, end='  ')
    count_ZDNAs_indel[name] = count_indel_flank(motif, filter_cols = ['RM_distance', 'STR_distance', 'nonSTR_distance', 'within_motif_distance'], useful_cols = [])
for motif, name in zip([ZDNAs_GY], ['ZDNAs_GY']):
    print('\r' + '                     ' + name, end='  ')
    count_ZDNAs_indel[name] = count_indel_flank(motif, filter_cols = ['RM_distance', 'STR_distance', 'nonSTR_distance', 'within_motif_distance'], useful_cols = ['Strand'])
for name in count_ZDNAs_indel:
    norm_ZDNAs_indel[name] = dict()
    norm_ZDNAs_indel[name]['ins'] = mut_norm_conf(count_ZDNAs_indel[name], snvindel = 'indel', min_count = 10, genome_AC_freq_current = variants_indel_count_AC_freq_all, tri_subset = triplet_mutations_und_ins, normtorandom = True, random_normaverage = normtorandom_indel['ins'], do_binconf = True)
    norm_ZDNAs_indel[name]['del'] = mut_norm_conf(count_ZDNAs_indel[name], snvindel = 'indel', min_count = 10, genome_AC_freq_current = variants_indel_count_AC_freq_all, tri_subset = triplet_mutations_und_del, normtorandom = True, random_normaverage = normtorandom_indel['del'], do_binconf = True)

#### Count G4s

In [None]:
count_G4s_indel = dict()
norm_G4s_indel = dict()
for motif, name in zip([K_G4s, PDS_G4s], ['K_G4s', 'PDS_G4s']):
    print('\r' + '                     ' + name, end='  ')
    count_G4s_indel[name] = count_indel_flank(motif, left_pos_col = 'start', right_pos_col = 'end', filter_cols = ['RM_distance', 'STR_distance', 'nonSTR_distance', 'within_motif_distance'], useful_cols = ['Strand'])
    norm_G4s_indel[name] = dict()
    norm_G4s_indel[name]['ins'] = mut_norm_conf(count_G4s_indel[name], snvindel = 'indel', min_count = 10, genome_AC_freq_current = variants_indel_count_AC_freq_all, tri_subset = triplet_mutations_und_ins, normtorandom = True, random_normaverage = normtorandom_indel['ins'], do_binconf = True)
    norm_G4s_indel[name]['del'] = mut_norm_conf(count_G4s_indel[name], snvindel = 'indel', min_count = 10, genome_AC_freq_current = variants_indel_count_AC_freq_all, tri_subset = triplet_mutations_und_del, normtorandom = True, random_normaverage = normtorandom_indel['del'], do_binconf = True)

### Save/load <a name="indel_analysis_flank_load"></a>

[Return to Table of Contents](#TOC)

In [None]:
# Save temporary output of the mutation counts
flanking_norm_indel = dict()
flanking_norm_indel['STR'] = norm_STRs_indel.copy()
flanking_norm_indel['IR'] = norm_IRs_indel.copy()
flanking_norm_indel['MR'] = norm_MRs_indel.copy()
flanking_norm_indel['DR'] = norm_DRs_indel.copy()
flanking_norm_indel['ZDNA'] = norm_ZDNAs_indel.copy()
flanking_norm_indel['G4'] = norm_G4s_indel.copy()
flanking_norm_indel['random'] =norm_random_indel.copy()

with open('./analysis/temp/flanking_norm_indel_chr'+str(chr_range)+'-22.pickle', 'wb') as handle:
    pickle.dump(flanking_norm_indel, handle, protocol=pickle.HIGHEST_PROTOCOL)

In [None]:
# Load temporary output of the mutation counts
with open('./analysis/temp/flanking_norm_indel_chr'+str(chr_range)+'-22.pickle', 'rb') as handle:
    flanking_norm_indel = pickle.load(handle)

normtorandom_indel = pd.DataFrame()
normtorandom_indel['ins'] = pd.Series([flanking_norm_indel['random']['ins'][0][qc_cutoff].loc[-50:50].median() for qc_cutoff in flanking_norm_indel['random']['ins'][0]], index = vqslod_list_indel)
normtorandom_indel['del'] = pd.Series([flanking_norm_indel['random']['del'][0][qc_cutoff].loc[-50:50].median() for qc_cutoff in flanking_norm_indel['random']['del'][0]], index = vqslod_list_indel)

normtorandom_indel['ins_long'] = pd.Series([flanking_norm_indel['random']['ins_long'][0][qc_cutoff].loc[-50:50].median() for qc_cutoff in flanking_norm_indel['random']['ins_long'][0]], index = vqslod_list_indel)
normtorandom_indel['ins_short'] = pd.Series([flanking_norm_indel['random']['ins_short'][0][qc_cutoff].loc[-50:50].median() for qc_cutoff in flanking_norm_indel['random']['ins_short'][0]], index = vqslod_list_indel)

### Plot <a name="indel_analysis_flank_plot"></a>

[Return to Table of Contents](#TOC)

In [None]:
# Color scale for indel QC cutoffs
vqslod_list_indel = [-np.inf, -1.0607, 0, 1.4]
QC_colors_indel = make_colorscale(vqslod_list_indel)
QC_colors_indel = pd.DataFrame(QC_colors_indel).transpose()
QC_colors_indel['name'] = ['no QC', 'pass', 'VQSLOD >0', 'VQSLOD >1.4']

# Function to generate individual plot traces
def plot_indel_flank_add(motif, category, mut_type, display_name, plot_name, row_n, col_n, input_dict = flanking_norm_indel, showleg = False):
    for QCfilter in QC_colors_indel[0].index:
        plot_name.add_trace(go.Scatter(name = QC_colors_indel['name'][QCfilter], legendgroup = QCfilter, showlegend=False, fill=None, x = input_dict[motif][category][mut_type][2][QCfilter].dropna().index, y = input_dict[motif][category][mut_type][2][QCfilter].dropna(), mode = 'lines', line = dict(width = 0, color = QC_colors_indel[0][QCfilter])), row=row_n, col=col_n)
        plot_name.add_trace(go.Scatter(name = QC_colors_indel['name'][QCfilter], legendgroup = QCfilter, showlegend = showleg, fill='tonexty', fillcolor = QC_colors_indel[1][QCfilter], x = input_dict[motif][category][mut_type][0][QCfilter].index, y = input_dict[motif][category][mut_type][0][QCfilter], mode = 'lines', line = dict(width = 2, color = QC_colors_indel[0][QCfilter])), row=row_n, col=col_n)
        plot_name.add_trace(go.Scatter(name = QC_colors_indel['name'][QCfilter], legendgroup = QCfilter, showlegend=False, fill='tonexty', fillcolor = QC_colors_indel[1][QCfilter], x = input_dict[motif][category][mut_type][1][QCfilter].dropna().index, y = input_dict[motif][category][mut_type][1][QCfilter].dropna(), mode = 'lines', line = dict(width = 0, color = QC_colors_indel[0][QCfilter])), row=row_n, col=col_n)
    plot_name.add_shape(type='line', x0=-500, y0=1, x1=500, y1=1, line=dict(color='Black', width = .3), row=row_n, col=col_n)
    plot_name.update_yaxes(zeroline = False, title = dict(text = display_name, font = dict(size = 18 if len(display_name) < 10 else round(18* (10/len(display_name))))), title_standoff = 0, row = row_n, col = col_n)

# Function to generate individual plots
def plot_indel_flank(motif, category, input_dict = flanking_norm_indel, log = False, yrange = [.01, 10]):
    mutnorm_all_binconf_fig = make_subplots(rows = 2, shared_xaxes = True)
    plot_indel_flank_add(motif = motif, category = category, mut_type = 'ins', display_name = motif + ' ' + category + ' ins', plot_name = mutnorm_all_binconf_fig, row_n = 1, col_n = 1, input_dict = input_dict, showleg = True)
    plot_indel_flank_add(motif = motif, category = category, mut_type = 'del', display_name = motif + ' ' + category + ' del', plot_name = mutnorm_all_binconf_fig, row_n = 2, col_n = 1, input_dict = input_dict, showleg = False)
    mutnorm_all_binconf_fig.update_xaxes(range = [-250,250])
    mutnorm_all_binconf_fig.update_yaxes(range = yrange, zeroline = False, type = 'log' if log == True else 'linear')
    return mutnorm_all_binconf_fig

#### Supplementary figure S2H <a name="indel_analysis_flank_plot_S2H"></a>

[Return to Table of Contents](#TOC)

In [None]:
# Generate Supplementary Figure S2H

indel_fig_combined_figS2h = make_subplots(rows = 5, cols = 4, shared_xaxes = True, vertical_spacing=0.02, horizontal_spacing = 0.11, subplot_titles = ['insertion', 'deletion', 'insertion', 'deletion'])

counter = 0
motif = 'STR'
for category in ['A', 'C', 'AT', 'AC', 'AG']:
    counter +=1
    plot_indel_flank_add(motif = motif, category = category, mut_type = 'ins', display_name = category, plot_name = indel_fig_combined_figS2h, row_n = counter, col_n = 1, showleg = True if counter == 1 else False)
    plot_indel_flank_add(motif = motif, category = category, mut_type = 'del', display_name = '', plot_name = indel_fig_combined_figS2h, row_n = counter, col_n = 2, showleg = False)

    
counter = 0
for motif, category, name in zip(['IR', 'DR', 'MR', 'ZDNA', 'G4'], ['long_IRs_loop10', 'long_DRs_loop10', 'long_MRs_loop10', 'long_ZDNAs', 'K_G4s'], ['Inverted', 'Direct', 'Mirror', 'Z-DNA', 'G4']):
    counter +=1
    plot_indel_flank_add(motif = motif, category = category, mut_type = 'ins', display_name = name, plot_name = indel_fig_combined_figS2h, row_n = counter, col_n = 3)
    plot_indel_flank_add(motif = motif, category = category, mut_type = 'del', display_name = '', plot_name = indel_fig_combined_figS2h, row_n = counter, col_n = 4)
    
indel_fig_combined_figS2h.update_xaxes(domain=[0, 0.21], col = 1)
indel_fig_combined_figS2h.update_xaxes(domain=[0.23, 0.44], col = 2)
indel_fig_combined_figS2h.update_xaxes(domain=[0.56, 0.77], col = 3)
indel_fig_combined_figS2h.update_xaxes(domain=[0.79, 1], col = 4)

indel_fig_combined_figS2h.update_yaxes(range = [-0.299,2.75], type = 'log', dtick = 1)
indel_fig_combined_figS2h.update_yaxes(showticklabels = False, col = 2)
indel_fig_combined_figS2h.update_yaxes(showticklabels = False, col = 4)

    
indel_fig_combined_figS2h.update_xaxes(range = [-50,50], tickmode = 'array', tickvals = list(range(-40,60,20)))
indel_fig_combined_figS2h.update_layout(height = 650, width = 800, margin = dict(l = 55, r = 25, b = 45, t = 30), legend=dict(y = -.075, x = -0.075, orientation='h', itemwidth = 30))
indel_fig_combined_figS2h.show()

In [None]:
indel_fig_combined_figS2h.write_image('./plots/revision_flanking_indel_fig_S2H.png', format='png', scale = 10, engine = 'orca')

# Calculate mutation frequency at internal motif positions <a name="mutation_internal"></a>

## Annotate positions within motifs <a name="mutation_internal_annotate"></a>

[Return to Table of Contents](#TOC)

In [None]:
# (Re)Load motif database
all_motifs_unique = pd.read_pickle('./custom_db/all_motifs_unique_chr'+str(chr_range)+'-22.pickle')

random_seq = pd.read_csv('./custom_db/random_sequences_set1_chr1-22.csv.gz', compression = 'gzip')
random_seq_set2 = pd.read_csv('./custom_db/random_sequences_set2_chr1-22.csv.gz', compression = 'gzip')
random_seq = pd.concat([random_seq, random_seq_set2])
# assign random strand
random_seq['Strand'] = np.random.randint(0,2, size=len(random_seq))
random_seq['Strand'] = random_seq['Strand'].replace(0, '-').replace(1,'+')
random_seq = distance_within_df(random_seq, 'within_motif')
random_seq = random_seq.drop_duplicates(subset = ['chrom', 'start']).copy()

In [None]:
# STR distance filter of 20 and non-STR distance filter of 5 for all motifs
all_motifs_unique = all_motifs_unique.loc[(all_motifs_unique['STR_distance_min'] > 20) & (all_motifs_unique[['nonSTR_distance_min', 'within_motif_distance_min', 'RM_distance_min']].fillna(0).min(axis=1) > 5)]
random_seq = random_seq.loc[(random_seq['STR_distance_min'] > 20) & (random_seq[['nonSTR_distance_min', 'within_motif_distance_min', 'RM_distance_min']].fillna(0).min(axis=1) > 5)].copy()

In [None]:
# Annotate positions within motifs
pos_expand_all = dict()

def pos_expand(input_df):
    
    useful_cols = [category for category in ['chrom', 'Type', 'repeat', 'status', 'Strand', 'GC%_stem', 'purine', '#MM', 'stem_len', 'spacer', 'length'] if category in input_df.columns]
    
    single_count_position_list = [category for category in ["5'_motif_pos", "3'_motif_pos", "5'_flank_pos", "3'_flank_pos"] if category in input_df.columns]
    count_position_list = [category for category in ['run_positions_middle', 'loop_positions_middle', 'run_positions_edge', 'loop_positions_edge', 'spacer_middle_pos', 'motif_pos_genome_middle'] if category in input_df.columns]
    single_pred_position_list = pd.Series(["spacer_5'_pred", "spacer_3'_pred", "5'_flank_pred", "3'_flank_pred"], index = ["spacer_5'_pos", "spacer_3'_pos", "5'_flank_pos", "3'_flank_pos"])
    single_pred_position_list = single_pred_position_list.loc[single_pred_position_list.isin(input_df.columns)]
    single_count_position_list = [category for category in single_count_position_list if category not in single_pred_position_list.index]
    pred_position_list = pd.Series(['MM_pred', 'MM_pred_L', 'MM_pred_R'], index = ['MM_pos_genome', 'MM_pos_L', 'MM_pos_R'])
    pred_position_list = pred_position_list.loc[pred_position_list.isin(input_df.columns)]
    
    pos_expand_current = dict()
    for position in single_count_position_list:
        pos_expand_current[position] = input_df[[position] + useful_cols].copy()
        pos_expand_current[position].columns = ['pos'] + useful_cols
        pos_expand_current[position]['category'] = position
    for position in count_position_list:
        pos_expand_current[position] = pd.DataFrame(flatten(input_df[position].dropna()), columns = ['pos'])
        pos_expand_current[position]['category'] = position
        for col in useful_cols:
            pos_expand_current[position][col] = flatten([[length]*len(pos) for length, pos in zip(input_df[col].loc[input_df[position].dropna().index], input_df[position].dropna())])
    pos_expand_current = pd.concat(pos_expand_current).reset_index(drop = True)

    if (len(single_pred_position_list) + len(pred_position_list)) > 0:    
        pos_expand_current_pred = dict()
        for position in single_pred_position_list.index:
            pos_expand_current_pred[position] = pd.DataFrame(input_df[position].dropna().astype(int))
            pos_expand_current_pred[position].columns = ['pos']
            pos_expand_current_pred[position]['category'] = position
            pos_expand_current_pred[position]['pred'] = input_df[single_pred_position_list[position]].loc[input_df[position].dropna().index]
            for col in useful_cols:
                pos_expand_current_pred[position][col] = input_df[col].loc[input_df[position].dropna().index]
        for position in pred_position_list.index:
            pos_expand_current_pred[position] = pd.DataFrame(flatten(input_df[position].dropna()), columns = ['pos'])
            pos_expand_current_pred[position]['category'] = position
            pos_expand_current_pred[position]['pred'] = flatten(input_df[pred_position_list[position]].loc[input_df[position].dropna().index])
            for col in useful_cols:
                pos_expand_current_pred[position][col] = flatten([[length]*len(pos) for length, pos in zip(input_df[col].loc[input_df[position].dropna().index], input_df[position].dropna())])
        pos_expand_current_pred = pd.concat(pos_expand_current_pred).reset_index(drop = True)
        pos_expand_current_all = pd.concat([pos_expand_current, pos_expand_current_pred]).reset_index(drop = True)  
        return pos_expand_current_all
    else:
        return pos_expand_current

### Locate interruptions within motifs and predict reversion mutations <a name="mutation_internal_annotate_interruptions"></a>
[Return to Table of Contents](#TOC)

#### STR motifs

In [None]:
all_STRs_unique = all_motifs_unique.loc[(all_motifs_unique['Type'] == 'STR') & (all_motifs_unique['repeat'].isin(repeats_highpower_allframes))].dropna(axis = 1, how = 'all').copy()
# Fix annoying naming of repeat frames which were not necesarily paired reverse complements
all_STRs_unique['repeat'] = [repeat if repeat in repeats_highpower else repeat_variations(repeat)[0][1] if repeat_variations(repeat)[0][1] in repeats_highpower else repeat_variations(repeat)[0][2] if repeat_variations(repeat)[0][2] in repeats_highpower else repeat_variations(repeat)[0][3] if repeat_variations(repeat)[0][3] in repeats_highpower else np.nan for repeat in all_STRs_unique['repeat']]

# STR in-frame interrupions
all_STRs_unique['MM_pos'] = [np.where([a!=b for a,b in zip(seq1, (frame*length)[:length])])[0] for seq1, frame, length in zip(all_STRs_unique['Sequence'], all_STRs_unique['repeat_frame_L'], all_STRs_unique['length'])]
all_STRs_unique['MM_pred'] = [[(frame*length)[:length][pos] for pos in positions] for chrom, start, positions, frame, length in zip(all_STRs_unique['chrom'], all_STRs_unique['start'], all_STRs_unique['MM_pos'], all_STRs_unique['repeat_frame_L'], all_STRs_unique['length'])]
all_STRs_unique['MM_pos'] = [pos_list if status == 'inframe' else np.nan for pos_list, status in zip(all_STRs_unique['MM_pos'], all_STRs_unique['status'])]
all_STRs_unique['MM_pred'] = [pred_list if status == 'inframe' else np.nan for pred_list, status in zip(all_STRs_unique['MM_pred'], all_STRs_unique['status'])]
all_STRs_unique['MM_pos_genome'] = [[start+pos for pos in positions] if type(positions) != float else np.nan for start, positions in zip(all_STRs_unique['start'], all_STRs_unique['MM_pos'])]

all_STRs_unique["5'_motif_pos"] = all_STRs_unique['start']
all_STRs_unique["3'_motif_pos"] = all_STRs_unique['end']-1
all_STRs_unique["5'_flank_pos"] = all_STRs_unique['start']-1
all_STRs_unique["3'_flank_pos"] = all_STRs_unique['end']

# Non-interruption positions
all_STRs_unique['motif_pos_genome_middle'] = [[pos for pos in range(start+1,end-1) if pos not in positions] if type(positions) != float else [pos for pos in range(start+1,end-1)] for start, end, positions in zip(all_STRs_unique["5'_motif_pos"], all_STRs_unique["3'_motif_pos"], all_STRs_unique['MM_pos_genome'])]

all_STRs_unique['Strand'] = all_STRs_unique['Strand'].fillna('+')

pos_expand_all['STR'] = pos_expand(all_STRs_unique)

#### IR/MR/DR motifs

In [None]:
all_IRs = all_motifs_unique.loc[(all_motifs_unique['Type'] == 'IR')].dropna(axis = 1, how = 'all').copy()

# Find location of interruptions within each IR
all_IRs['MM_pos'] = [np.where([a!=b for a,b in zip(seq1, seq2)]) for seq1, seq2 in zip(all_IRs['seq_L'], all_IRs['RC_seq_R'])]
all_IRs['MM_pos'] = [pos[0] for pos in all_IRs['MM_pos']]
all_IRs['MM_pos_L'] = [[start+pos for pos in positions] for start, positions in zip(all_IRs['start'], all_IRs['MM_pos'])]
all_IRs['MM_pos_R'] = [[end-pos-1 for pos in positions] for end, positions in zip(all_IRs['end'], all_IRs['MM_pos'])]
all_IRs['MM_pred_L'] = [[reverse_complement(reference_genome[chrom][pos]) for pos in positions] for chrom, positions in zip(all_IRs['chrom'], all_IRs['MM_pos_R'])]
all_IRs['MM_pred_R'] = [[reverse_complement(reference_genome[chrom][pos]) for pos in positions] for chrom, positions in zip(all_IRs['chrom'], all_IRs['MM_pos_L'])]

# Spacer position predictions
all_IRs['spacer_pos'] = [list(range(start+length, end-length)) for start, end, length in zip(all_IRs['start'], all_IRs['end'], all_IRs['stem_len'].astype(int))]
all_IRs["spacer_5'_pos"] = [pos[0] if s_len>1 else 0 for pos, s_len in zip(all_IRs['spacer_pos'], all_IRs['spacer'].astype(int))]
all_IRs["spacer_middle_pos"] = [pos[1:-1] if s_len>1 else pos if s_len == 1 else [] for pos, s_len in zip(all_IRs['spacer_pos'], all_IRs['spacer'].astype(int))]
all_IRs["spacer_3'_pos"] = [pos[-1] if s_len>1 else 0 for pos, s_len in zip(all_IRs['spacer_pos'], all_IRs['spacer'].astype(int))]
all_IRs["spacer_5'_pred"] = [reverse_complement(reference_genome[chrom][pos]) if pos > 0 else np.nan for chrom, pos in zip(all_IRs['chrom'], all_IRs["spacer_3'_pos"])]
all_IRs["spacer_3'_pred"] = [reverse_complement(reference_genome[chrom][pos]) if pos > 0 else np.nan  for chrom, pos in zip(all_IRs['chrom'], all_IRs["spacer_5'_pos"])]

# Immediate flank position predictions
all_IRs["5'_motif_pos"] = all_IRs['start']
all_IRs["3'_motif_pos"] = all_IRs['end']-1
all_IRs["5'_flank_pos"] = all_IRs['start']-1
all_IRs["3'_flank_pos"] = all_IRs['end']
all_IRs["5'_flank_pred"] = [reverse_complement(reference_genome[chrom][pos]) for chrom, pos in zip(all_IRs['chrom'], all_IRs["3'_flank_pos"])]
all_IRs["3'_flank_pred"] = [reverse_complement(reference_genome[chrom][pos]) for chrom, pos in zip(all_IRs['chrom'], all_IRs["5'_flank_pos"])]

# Non-interruption positions
all_IRs['motif_pos_genome_middle'] = [[pos for pos in range(start+1,end-1) if pos not in positions] if type(positions) != float else [pos for pos in range(start+1,end-1)] for start, end, positions in zip(all_IRs['start'], all_IRs['end'], all_IRs['MM_pos_L'])]
all_IRs['motif_pos_genome_middle'] = [[pos for pos in motif_pos if pos not in positions] if type(positions) != float else [pos for pos in motif_pos] for motif_pos, positions in zip(all_IRs['motif_pos_genome_middle'], all_IRs['MM_pos_R'])]
all_IRs['motif_pos_genome_middle'] = [[pos for pos in motif_pos if pos not in positions] if type(positions) != float else [pos for pos in motif_pos] for motif_pos, positions in zip(all_IRs['motif_pos_genome_middle'], all_IRs['spacer_pos'])]

# GC content not including spacer
all_IRs['GC%_stem'] = ((all_IRs['seq_L'].str.count('C') + all_IRs['seq_L'].str.count('G') + all_IRs['seq_R'].str.count('C') + all_IRs['seq_R'].str.count('G')) / (all_IRs['stem_len'] *2)).round(2)

pos_expand_all['IR'] = pos_expand(all_IRs)
pos_expand_all['IR'] = pos_expand_all['IR'].loc[pos_expand_all['IR']['pos'] > 0].copy()

In [None]:
all_MRs = all_motifs_unique.loc[(all_motifs_unique['Type'] == 'MR')].dropna(axis = 1, how = 'all').copy()

# Find location of interruptions within each MR
all_MRs['MM_pos'] = [np.where([a!=b for a,b in zip(seq1, seq2)]) for seq1, seq2 in zip(all_MRs['seq_L'], all_MRs['rev_seq_R'])]
all_MRs['MM_pos'] = [pos[0] for pos in all_MRs['MM_pos']]
all_MRs['MM_pos_L'] = [[start+pos for pos in positions] for start, positions in zip(all_MRs['start'], all_MRs['MM_pos'])]
all_MRs['MM_pos_R'] = [[end-pos-1 for pos in positions] for end, positions in zip(all_MRs['end'], all_MRs['MM_pos'])]
all_MRs['MM_pred_L'] = [[reference_genome[chrom][pos] for pos in positions] for chrom, positions in zip(all_MRs['chrom'], all_MRs['MM_pos_R'])]
all_MRs['MM_pred_R'] = [[reference_genome[chrom][pos] for pos in positions] for chrom, positions in zip(all_MRs['chrom'], all_MRs['MM_pos_L'])]

# Spacer position predictions
all_MRs['spacer_pos'] = [list(range(start+length, end-length)) for start, end, length in zip(all_MRs['start'], all_MRs['end'], all_MRs['stem_len'].astype(int))]
all_MRs["spacer_5'_pos"] = [pos[0] if s_len>1 else 0 for pos, s_len in zip(all_MRs['spacer_pos'], all_MRs['spacer'].astype(int))]
all_MRs["spacer_middle_pos"] = [pos[1:-1] if s_len>0 else [] for pos, s_len in zip(all_MRs['spacer_pos'], all_MRs['spacer'].astype(int))]
all_MRs["spacer_3'_pos"] = [pos[-1] if s_len>1 else 0 for pos, s_len in zip(all_MRs['spacer_pos'], all_MRs['spacer'].astype(int))]
all_MRs["spacer_5'_pred"] = [reference_genome[chrom][pos] if pos > 0 else np.nan for chrom, pos in zip(all_MRs['chrom'], all_MRs["spacer_3'_pos"])]
all_MRs["spacer_3'_pred"] = [reference_genome[chrom][pos] if pos > 0 else np.nan for chrom, pos in zip(all_MRs['chrom'], all_MRs["spacer_5'_pos"])]

# Immediate flank positions
all_MRs["5'_motif_pos"] = all_MRs['start']
all_MRs["3'_motif_pos"] = all_MRs['end']-1
all_MRs["5'_flank_pos"] = all_MRs['start']-1
all_MRs["3'_flank_pos"] = all_MRs['end']
all_MRs["5'_flank_pred"] = [reference_genome[chrom][pos] for chrom, pos in zip(all_MRs['chrom'], all_MRs["3'_flank_pos"])]
all_MRs["3'_flank_pred"] = [reference_genome[chrom][pos] for chrom, pos in zip(all_MRs['chrom'], all_MRs["5'_flank_pos"])]

# Non-interruption positions
all_MRs['motif_pos_genome_middle'] = [[pos for pos in range(start+1,end-1) if pos not in positions] if type(positions) != float else [pos for pos in range(start+1,end-1)] for start, end, positions in zip(all_MRs['start'], all_MRs['end'], all_MRs['MM_pos_L'])]
all_MRs['motif_pos_genome_middle'] = [[pos for pos in motif_pos if pos not in positions] if type(positions) != float else [pos for pos in motif_pos] for motif_pos, positions in zip(all_MRs['motif_pos_genome_middle'], all_MRs['MM_pos_R'])]
all_MRs['motif_pos_genome_middle'] = [[pos for pos in motif_pos if pos not in positions] if type(positions) != float else [pos for pos in motif_pos] for motif_pos, positions in zip(all_MRs['motif_pos_genome_middle'], all_MRs['spacer_pos'])]

# GC content not including spacer
all_MRs['GC%_stem'] = ((all_MRs['seq_L'].str.count('C') + all_MRs['seq_L'].str.count('G') + all_MRs['seq_R'].str.count('C') + all_MRs['seq_R'].str.count('G')) / (all_MRs['stem_len'] *2)).round(2)

# purine content not including spacer, with 0 being 50% purine and 0.5 being homopurine/homopyrimidine, with + strand being more purine on the reference strand
all_MRs['purine'] = (all_MRs['seq_L'].str.count('A') + all_MRs['seq_L'].str.count('G') + all_MRs['seq_R'].str.count('A') + all_MRs['seq_R'].str.count('G')) / (all_MRs['stem_len'] *2)
all_MRs['purine'] = 0.5 - all_MRs['purine']
all_MRs['Strand'] = ['+' if purine > 0 else '-' for purine in all_MRs['purine']]
all_MRs['purine'] = abs(all_MRs['purine'].round(2))

pos_expand_all['MR'] = pos_expand(all_MRs)
pos_expand_all['MR'] = pos_expand_all['MR'].loc[pos_expand_all['MR']['pos'] > 0].copy()

In [None]:
all_DRs = all_motifs_unique.loc[(all_motifs_unique['Type'] == 'DR')].dropna(axis = 1, how = 'all').copy()

# Find location of interruptions within each DR
all_DRs['MM_pos'] = [np.where([a!=b for a,b in zip(seq1, seq2)]) for seq1, seq2 in zip(all_DRs['seq_L'], all_DRs['seq_R'])]
all_DRs['MM_pos'] = [pos[0] for pos in all_DRs['MM_pos']]
all_DRs['MM_pos_L'] = [[start+pos for pos in positions] for start, positions in zip(all_DRs['start'], all_DRs['MM_pos'])]
all_DRs['MM_pos_R'] = [[start+pos+stem+spacer for pos in positions] for start, positions, stem, spacer in zip(all_DRs['start'], all_DRs['MM_pos'], all_DRs['stem_len'].astype(int), all_DRs['spacer'].astype(int))]
all_DRs['MM_pred_L'] = [[reference_genome[chrom][pos] for pos in positions][::-1] for chrom, positions in zip(all_DRs['chrom'], all_DRs['MM_pos_R'])]
all_DRs['MM_pred_R'] = [[reference_genome[chrom][pos] for pos in positions][::-1] for chrom, positions in zip(all_DRs['chrom'], all_DRs['MM_pos_L'])]

# DDRect repeat spacer/flank predictions
all_DRs['spacer_pos'] = [list(range(start+length, end-length)) for start, end, length in zip(all_DRs['start'], all_DRs['end'], all_DRs['stem_len'].astype(int))]
all_DRs["spacer_5'_pos"] = [pos[0] if s_len>1 else 0 for pos, s_len in zip(all_DRs['spacer_pos'], all_DRs['spacer'].astype(int))]
all_DRs["spacer_middle_pos"] = [pos[1:-1] if s_len>0 else [] for pos, s_len in zip(all_DRs['spacer_pos'], all_DRs['spacer'].astype(int))]
all_DRs["spacer_3'_pos"] = [pos[-1] if s_len>1 else 0 for pos, s_len in zip(all_DRs['spacer_pos'], all_DRs['spacer'].astype(int))]

all_DRs["5'_motif_pos"] = all_DRs['start']
all_DRs["3'_motif_pos"] = all_DRs['end']-1
all_DRs["5'_flank_pos"] = all_DRs['start']-1
all_DRs["3'_flank_pos"] = all_DRs['end']

all_DRs["5'_flank_pred"] = [reference_genome[chrom][pos] for chrom, pos in zip(all_DRs['chrom'], all_DRs["spacer_3'_pos"])]
all_DRs["3'_flank_pred"] = [reference_genome[chrom][pos] for chrom, pos in zip(all_DRs['chrom'], all_DRs["spacer_5'_pos"])]
all_DRs["spacer_5'_pred"] = [reference_genome[chrom][pos] for chrom, pos in zip(all_DRs['chrom'], all_DRs["3'_flank_pos"])]
all_DRs["spacer_3'_pred"] = [reference_genome[chrom][pos] for chrom, pos in zip(all_DRs['chrom'], all_DRs["5'_flank_pos"])]

# Non-interruption positions
all_DRs['motif_pos_genome_middle'] = [[pos for pos in range(start+1,end-1) if pos not in positions] if type(positions) != float else [pos for pos in range(start+1,end-1)] for start, end, positions in zip(all_DRs['start'], all_DRs['end'], all_DRs['MM_pos_L'])]
all_DRs['motif_pos_genome_middle'] = [[pos for pos in motif_pos if pos not in positions] if type(positions) != float else [pos for pos in motif_pos] for motif_pos, positions in zip(all_DRs['motif_pos_genome_middle'], all_DRs['MM_pos_R'])]
all_DRs['motif_pos_genome_middle'] = [[pos for pos in motif_pos if pos not in positions] if type(positions) != float else [pos for pos in motif_pos] for motif_pos, positions in zip(all_DRs['motif_pos_genome_middle'], all_DRs['spacer_pos'])]

# GC content not including spacer
all_DRs['GC%_stem'] = ((all_DRs['seq_L'].str.count('C') + all_DRs['seq_L'].str.count('G') + all_DRs['seq_R'].str.count('C') + all_DRs['seq_R'].str.count('G')) / (all_DRs['stem_len'] *2)).round(2)

pos_expand_all['DR'] = pos_expand(all_DRs)
pos_expand_all['DR'] = pos_expand_all['DR'].loc[pos_expand_all['DR']['pos'] > 0].copy()

### Annotate other positions <a name="mutation_internal_annotate_other"></a>

[Return to Table of Contents](#TOC)

#### G4 positions

In [None]:
all_G4s = all_motifs_unique.loc[(all_motifs_unique['Type'] == 'G4')].dropna(axis = 1, how = 'all').copy()

# Find locations of G-runs and loops
all_G4s['run_positions'] = [[pos.start(0) for pos in re.finditer('GGG', seq, overlapped=True)] if strand == '+' else [pos.start(0) for pos in re.finditer('CCC', seq, overlapped=True)] for seq, strand in zip(all_G4s['Sequence'], all_G4s['Strand'])]
all_G4s['run_positions'] = [sorted(list(set(positions + [pos+1 for pos in positions] + [pos+2 for pos in positions]))) for positions in all_G4s['run_positions']]
all_G4s['loop_positions'] = [[pos for pos in range(len(seq)) if pos not in positions] for seq, positions in zip(all_G4s['Sequence'], all_G4s['run_positions'])]

# Separate G4 runs and loops into middle and edge positions
all_G4s['run_positions_middle'] = [[pos for pos in positions if (pos+1 in positions) & (pos-1 in positions)] if type(positions) != float else np.nan for positions in all_G4s['run_positions']]
all_G4s['loop_positions_middle'] = [[pos for pos in positions if (pos+1 in positions) & (pos-1 in positions)] if type(positions) != float else np.nan for positions in all_G4s['loop_positions']]
all_G4s['run_positions_edge'] = [[pos for pos in positions if pos not in middles] if type(positions) != float else np.nan for positions, middles in zip(all_G4s['run_positions'], all_G4s['run_positions_middle'])]
all_G4s['loop_positions_edge'] = [[pos for pos in positions if pos not in middles] if type(positions) != float else np.nan for positions, middles in zip(all_G4s['loop_positions'], all_G4s['loop_positions_middle'])]

all_G4s['run_positions_middle'] = [[start+pos for pos in positions] if type(positions) != float else np.nan for start, positions in zip(all_G4s['start'], all_G4s['run_positions_middle'])]
all_G4s['loop_positions_middle'] = [[start+pos for pos in positions] if type(positions) != float else np.nan for start, positions in zip(all_G4s['start'], all_G4s['loop_positions_middle'])]
all_G4s['run_positions_edge'] = [[start+pos for pos in positions] if type(positions) != float else np.nan for start, positions in zip(all_G4s['start'], all_G4s['run_positions_edge'])]
all_G4s['loop_positions_edge'] = [[start+pos for pos in positions] if type(positions) != float else np.nan for start, positions in zip(all_G4s['start'], all_G4s['loop_positions_edge'])]

# Immediate flanking positions
all_G4s["5'_flank_pos"] = all_G4s['start']-1
all_G4s["3'_flank_pos"] = all_G4s['end']

pos_expand_all['G4'] = pos_expand(all_G4s)
pos_expand_all['G4']['pred'] = np.nan

#### Z-DNA positions

In [None]:
all_ZDNAs = all_motifs_unique.loc[(all_motifs_unique['Type'] == 'ZDNA')].dropna(axis = 1, how = 'all').copy()

all_ZDNAs["5'_motif_pos"] = all_ZDNAs['start']
all_ZDNAs["3'_motif_pos"] = all_ZDNAs['end']-1
all_ZDNAs["5'_flank_pos"] = all_ZDNAs['start']-1
all_ZDNAs["3'_flank_pos"] = all_ZDNAs['end']
all_ZDNAs['motif_pos_genome_middle'] = [[pos for pos in range(start+1,end-1)] for start, end in zip(all_ZDNAs["5'_motif_pos"], all_ZDNAs["3'_motif_pos"])]

ZDNAs_GY = all_ZDNAs.dropna(subset = ['Strand'])
all_ZDNAs['Strand'] = all_ZDNAs['Strand'].fillna('+')

pos_expand_all['ZDNA'] = pos_expand(all_ZDNAs)
pos_expand_all['ZDNA_GY'] = pos_expand(ZDNAs_GY)

pos_expand_all['ZDNA']['pred'] = np.nan
pos_expand_all['ZDNA_GY']['pred'] = np.nan

#### Random positions

In [None]:
random_seq['motif_pos_genome_middle'] = [[pos for pos in range(start+1,end-1)] for start, end in zip(random_seq['start'], random_seq['end'])]
random_seq["5'_motif_pos"] = random_seq['start']
random_seq["3'_motif_pos"] = random_seq['end']-1
random_seq["5'_flank_pos"] = random_seq['start']-1
random_seq["3'_flank_pos"] = random_seq['end']

pos_expand_all['random'] = pos_expand(random_seq)

pos_expand_all['random']['pred'] = np.nan
pos_expand_all['random']['Type'] = 'random'

### Combine all positions into database <a name="mutation_internal_annotate_combine"></a>
- one line per position

[Return to Table of Contents](#TOC)

In [None]:
# Save temporary output of the motif positions
with open('./analysis/temp/motif_internal_positions_chr'+str(chr_range)+'-22.pickle', 'wb') as handle:
    pickle.dump(pos_expand_all, handle, protocol=pickle.HIGHEST_PROTOCOL)

In [None]:
# Load temporary output of the motif positions
with open('./analysis/temp/motif_internal_positions_chr'+str(chr_range)+'-22.pickle', 'rb') as handle:
    pos_expand_all = pickle.load(handle)

## Calculate mutation frequency at specified internal positions <a name="mutation_internal_count"></a>

[Return to Table of Contents](#TOC)

#### Functions to count and analyze mutations within motifs <a name="mutation_internal_count_functions"></a>


In [None]:
def count_internal_mut_pos_chrom(chrom, pos_current, input_mut_dict, useful_cols, qc_cutoff_list, noAC, gc_nmer, pos_col):
    # for each qc_cutoff in the mutation dataset, find mutations overlapping the search coordinates
    current_mut_sum = dict()

    current_mut_chrom = input_mut_dict[chrom].copy()
    if noAC == True:
        current_mut_chrom[['A', 'T', 'G', 'C']] = (current_mut_chrom[['A', 'T', 'G', 'C']] > 0).astype(int)  
    for qc_cutoff in qc_cutoff_list:
        current_mut_sum[qc_cutoff[0]] = dict()
        current_mut_qc = (current_mut_chrom[['A', 'T', 'G', 'C']]).mul((current_mut_chrom[['qual_A', 'qual_T', 'qual_G', 'qual_C']] >= qc_cutoff[0]).astype(int).values, axis=0).mul((current_mut_chrom[['inbr_A', 'inbr_T', 'inbr_G', 'inbr_C']] >= qc_cutoff[1]).astype(int).values, axis=0)
        current_mut_qc.index = current_mut_qc.index - 1    # change gnomad from base1 to base0

        # non-pred positions
        pos_current_other = pd.DataFrame(pos_current.loc[pos_current['pred'].isna()].groupby([pos_col] + useful_cols).count()['Type'])
        pos_current_other[['A', 'T', 'G', 'C']] = 1
        
        if len(pos_current['pred'].dropna()) > 0:

            # predicted positions
            pred_current = pos_current.dropna(subset = ['pred']).groupby([pos_col] + useful_cols +  ['pred']).count()['Type'].unstack().fillna(0)

            # predicted mutations
            current_mut_qc_pred = current_mut_qc.reindex(pred_current.index.get_level_values(pos_col)).mul((pred_current >0)[['A', 'T', 'G', 'C']]).fillna(0).reset_index().set_index([pos_col])
            current_mut_qc_pred['Tri'] = [tri_function(chrom, pos, base = 0) for pos in current_mut_qc_pred.index]
            current_mut_qc_pred['tri_count'] = 1   
            current_mut_qc_pred['seq_'+str(gc_nmer)] = [reference_lookup(chrom, pos, round((gc_nmer-1)/2)) for pos in current_mut_qc_pred.index]
            current_mut_qc_pred['GC_'+str(gc_nmer)] = (current_mut_qc_pred['seq_'+str(gc_nmer)].str.count('G') + current_mut_qc_pred['seq_'+str(gc_nmer)].str.count('C')) / (gc_nmer - current_mut_qc_pred['seq_'+str(gc_nmer)].str.count('N'))
            current_mut_sum[qc_cutoff[0]]['pred'] = current_mut_qc_pred.loc[current_mut_qc_pred['Tri'].isin(all_triplets)].groupby(useful_cols + ['Tri']).sum().copy()

            #return current_mut_qc_pred
            
            # against prediction
            current_mut_qc_againstpred = current_mut_qc.reindex(pred_current.index.get_level_values(pos_col)).mul((pred_current == 0)[['A', 'T', 'G', 'C']]).fillna(0).reset_index().set_index([pos_col])
            current_mut_qc_againstpred['Tri'] = [tri_function(chrom, pos, base = 0) for pos in current_mut_qc_againstpred.index]
            current_mut_qc_againstpred['tri_count'] = 1   
            current_mut_qc_againstpred['seq_'+str(gc_nmer)] = [reference_lookup(chrom, pos, round((gc_nmer-1)/2)) for pos in current_mut_qc_againstpred.index]
            current_mut_qc_againstpred['GC_'+str(gc_nmer)] = (current_mut_qc_againstpred['seq_'+str(gc_nmer)].str.count('G') + current_mut_qc_againstpred['seq_'+str(gc_nmer)].str.count('C')) / (gc_nmer - current_mut_qc_againstpred['seq_'+str(gc_nmer)].str.count('N'))
            current_mut_sum[qc_cutoff[0]]['against_pred'] = current_mut_qc_againstpred.loc[current_mut_qc_againstpred['Tri'].isin(all_triplets)].groupby(useful_cols + ['Tri']).sum().copy()

        # non-predicted positions
        current_mut_qc_nonpredicted = current_mut_qc.reindex(pos_current_other.index.get_level_values(pos_col)).mul(pos_current_other[['A', 'T', 'G', 'C']]).fillna(0).reset_index().set_index([pos_col])       
        current_mut_qc_nonpredicted['Tri'] = [tri_function(chrom, pos, base = 0) for pos in current_mut_qc_nonpredicted.index]
        current_mut_qc_nonpredicted['tri_count'] = 1   
        current_mut_qc_nonpredicted['seq_'+str(gc_nmer)] = [reference_lookup(chrom, pos, round((gc_nmer-1)/2)) for pos in current_mut_qc_nonpredicted.index]
        current_mut_qc_nonpredicted['GC_'+str(gc_nmer)] = (current_mut_qc_nonpredicted['seq_'+str(gc_nmer)].str.count('G') + current_mut_qc_nonpredicted['seq_'+str(gc_nmer)].str.count('C')) / (gc_nmer - current_mut_qc_nonpredicted['seq_'+str(gc_nmer)].str.count('N'))
        current_mut_sum[qc_cutoff[0]]['non_pred'] = current_mut_qc_nonpredicted.loc[current_mut_qc_nonpredicted['Tri'].isin(all_triplets)].groupby(useful_cols + ['Tri']).sum().copy()

        current_mut_sum[qc_cutoff[0]] = pd.concat(current_mut_sum[qc_cutoff[0]])
        current_mut_sum[qc_cutoff[0]].index.rename('prediction', level = 0, inplace = True)
    return current_mut_sum

In [None]:
# note: removed several astype(int) commands for compatibility with non-integer mutation counts (due to AC correction factor)

In [None]:
def count_internal_mut_pos(input_pos_df, input_mut_dict = gnomad_slim_all, qc_cutoff_list = [(-np.inf, -np.inf), (-2.774, -0.3), (0, -0.3), (4, -0.3)], pos_col = 'pos', chrom_col = 'chrom', strand_col = 'Strand', strand_names = ('+', '-'), useful_cols = [], noAC = False, gc_correction_dict = gc_correction_bytri, gc_nmer = False, combine_LR = False):
   
    current_mut_sum_chrom = dict()
    for chrom in range(chr_range,23):
        current_mut_sum_chrom[chrom] = count_internal_mut_pos_chrom(chrom, input_pos_df.loc[input_pos_df[chrom_col] == chrom].copy(), input_mut_dict, pos_col = pos_col, useful_cols = useful_cols, qc_cutoff_list = qc_cutoff_list, noAC = noAC, gc_nmer = gc_nmer)
        print('finished chr' + str(chrom) + '    ', end="\r", flush=True)

    current_mut_sum = dict()
    for qc_cutoff in qc_cutoff_list:
        current_mut_sum[qc_cutoff[0]] = pd.concat([current_mut_sum_chrom[chrom][qc_cutoff[0]] for chrom in range(chr_range,23)])
    
    useful_cols = ['prediction'] + useful_cols
    
    # apply reverse complement to - strand triplets, mutation counts and positions
    if strand_col in useful_cols:
        useful_cols.remove(strand_col)
        internal_position_categories = pd.Series(['MM_pos', 'MM_pred', 'MM_pos_genome', "3'_motif_pos", "5'_motif_pos", "3'_flank_pos", "5'_flank_pos", 'motif_pos_genome_middle', 'MM_pos_R', 'MM_pos_L', 'MM_pred_R', 'MM_pred_L', 'spacer_pos', "spacer_3'_pos", "spacer_middle_pos", "spacer_5'_pos", "spacer_3'_pred", "spacer_5'_pred", "3'_flank_pred", "5'_flank_pred", 'run_positions', 'loop_positions', 'run_positions_middle', 'loop_positions_middle', 'run_positions_edge', 'loop_positions_edge'], index = ['MM_pos', 'MM_pred', 'MM_pos_genome', "5'_motif_pos", "3'_motif_pos", "5'_flank_pos", "3'_flank_pos", 'motif_pos_genome_middle', 'MM_pos_L', 'MM_pos_R', 'MM_pred_L', 'MM_pred_R', 'spacer_pos', "spacer_5'_pos", "spacer_middle_pos", "spacer_3'_pos", "spacer_5'_pred", "spacer_3'_pred", "5'_flank_pred", "3'_flank_pred", 'run_positions', 'loop_positions', 'run_positions_middle', 'loop_positions_middle', 'run_positions_edge', 'loop_positions_edge'])

        current_mut_sum_strand_F = dict()
        current_mut_sum_strand_R = dict()
        current_mut_sum_bothstrands = dict()
        for qc_cutoff in current_mut_sum:
            current_mut_sum_strand_F[qc_cutoff] = current_mut_sum[qc_cutoff].reset_index().loc[current_mut_sum[qc_cutoff].reset_index()['Strand'] == '+']
            current_mut_sum_strand_R[qc_cutoff] = current_mut_sum[qc_cutoff].reset_index().loc[current_mut_sum[qc_cutoff].reset_index()['Strand'] == '-']
            current_mut_sum_strand_R[qc_cutoff]['category'] = internal_position_categories.reindex(current_mut_sum_strand_R[qc_cutoff]['category']).values
            if 'repeat' in useful_cols:
                current_mut_sum_strand_R[qc_cutoff]['repeat'] = current_mut_sum_strand_R[qc_cutoff]['repeat'].apply(reverse_complement)
            current_mut_sum_strand_R[qc_cutoff]['Tri'] = current_mut_sum_strand_R[qc_cutoff]['Tri'].apply(reverse_complement)
            current_mut_sum_strand_R[qc_cutoff][['A', 'T', 'G', 'C']] = current_mut_sum_strand_R[qc_cutoff][['T', 'A', 'C', 'G']].values
            current_mut_sum_bothstrands[qc_cutoff] = pd.concat([current_mut_sum_strand_F[qc_cutoff], current_mut_sum_strand_R[qc_cutoff]]).groupby(useful_cols + ['Tri']).sum()#.fillna(0)
    else:
        current_mut_sum_bothstrands = dict()
        for qc_cutoff in current_mut_sum:
            current_mut_sum_bothstrands[qc_cutoff] = current_mut_sum[qc_cutoff].reset_index().groupby(useful_cols + ['Tri']).sum()#.fillna(0)
    
    if combine_LR == True:
        for qc_cutoff in current_mut_sum:
            current_mut_sum_bothstrands[qc_cutoff] = current_mut_sum_bothstrands[qc_cutoff].reset_index()
            current_mut_sum_bothstrands[qc_cutoff]['category'] = current_mut_sum_bothstrands[qc_cutoff]['category'].str.replace("3'_", '').str.replace("5'_", '').str.replace('_L', '').str.replace('_R', '')
            current_mut_sum_bothstrands[qc_cutoff] = current_mut_sum_bothstrands[qc_cutoff].groupby(useful_cols + ['Tri']).sum()

    if gc_nmer != False:
        for qc_cutoff in current_mut_sum:
            current_mut_sum_bothstrands[qc_cutoff]['GC_'+str(gc_nmer)] = (current_mut_sum_bothstrands[qc_cutoff]['GC_'+str(gc_nmer)] / current_mut_sum_bothstrands[qc_cutoff]['tri_count']).round(3)
    
    # reformat output to NNN_N rows x pos columns, and split mut counts and trinucleotide counts    

    current_tri_sum = current_mut_sum_bothstrands[qc_cutoff].reset_index().groupby(useful_cols + ['Tri']).sum()['tri_count'].unstack().transpose().fillna(0).astype(int)

    current_mut_sum_reformat = dict()
    if gc_nmer != False:
        useful_cols = useful_cols + ['GC_'+str(gc_nmer)]
    for qc_cutoff in current_mut_sum:
        current_mut_sum_reformat[qc_cutoff] = current_mut_sum_bothstrands[qc_cutoff].reset_index().groupby(useful_cols + ['Tri']).sum()[['A', 'T', 'G', 'C']].unstack().transpose().fillna(0)
        current_mut_sum_reformat[qc_cutoff].index = current_mut_sum_reformat[qc_cutoff].index.get_level_values('Tri') + '_' + current_mut_sum_reformat[qc_cutoff].index.get_level_values(0)
        current_mut_sum_reformat[qc_cutoff] = current_mut_sum_reformat[qc_cutoff].reindex(triplet_mutations_und)
        current_mut_sum_reformat[qc_cutoff].index.name = 'Mut' 
    
    if gc_nmer == False:
        return current_mut_sum_reformat.copy(), current_tri_sum.copy()
    
    # GC window correction
    current_mut_sum_GCcorrect = dict()
    for qc_cutoff in current_mut_sum:
        current_mut_sum_GCcorrect[qc_cutoff] = current_mut_sum_reformat[qc_cutoff].mul((gc_correction_dict[qc_cutoff].iloc[np.searchsorted(gc_correction_dict[qc_cutoff].index, current_mut_sum_reformat[qc_cutoff].columns.get_level_values('GC_'+str(gc_nmer)))]).transpose().values)
        current_mut_sum_GCcorrect[qc_cutoff] = current_mut_sum_GCcorrect[qc_cutoff].transpose().reset_index().groupby(useful_cols[:-1]).sum().transpose().reindex(triplet_mutations_und)
        current_mut_sum_reformat[qc_cutoff] = current_mut_sum_reformat[qc_cutoff].transpose().reset_index().groupby(useful_cols[:-1]).sum().transpose().reindex(triplet_mutations_und)

    return current_mut_sum_reformat.copy(), current_tri_sum.copy(), current_mut_sum_GCcorrect.copy()

In [None]:
def count_internal_resum(input_count_dict, useful_cols, tri_subset = triplet_mutations_und, selections = False, gc_correct = True):
    output_dict = dict()
    output_dict[0] = dict()
    for qc_cutoff in input_count_dict[0]:
        output_dict[0][qc_cutoff] = input_count_dict[0][qc_cutoff].copy().transpose().reset_index().groupby(useful_cols).sum().transpose().reindex(tri_subset)
    output_dict[1] = input_count_dict[1].copy().transpose().reset_index().groupby(useful_cols).sum().transpose().reindex(all_triplets)
    if gc_correct == True:
        output_dict[2] = dict()
        for qc_cutoff in input_count_dict[2]:
            output_dict[2][qc_cutoff] = input_count_dict[2][qc_cutoff].copy().transpose().reset_index().groupby(useful_cols).sum().transpose().reindex(tri_subset)    
    if selections != False:
        for qc_cutoff in input_count_dict[0]:
            output_dict[0][qc_cutoff] = output_dict[0][qc_cutoff].transpose().reset_index()
            if gc_correct == True:
                output_dict[2][qc_cutoff] = output_dict[2][qc_cutoff].transpose().reset_index()
        output_dict[1] = output_dict[1].transpose().reset_index()
        for col, choice in zip(useful_cols, selections):
            if choice != False:
                if type(choice) == list:
                    for qc_cutoff in input_count_dict[0]:
                        output_dict[0][qc_cutoff] =  output_dict[0][qc_cutoff].loc[output_dict[0][qc_cutoff][col].isin(choice)]
                        if gc_correct == True:
                            output_dict[2][qc_cutoff] =  output_dict[2][qc_cutoff].loc[output_dict[2][qc_cutoff][col].isin(choice)]
                    output_dict[1] =  output_dict[1].loc[output_dict[1][col].isin(choice)]
                else:
                    for qc_cutoff in input_count_dict[0]:
                        output_dict[0][qc_cutoff] =  output_dict[0][qc_cutoff].loc[output_dict[0][qc_cutoff][col] == choice]
                        if gc_correct == True:
                            output_dict[2][qc_cutoff] =  output_dict[2][qc_cutoff].loc[output_dict[2][qc_cutoff][col] == choice]
                    output_dict[1] =  output_dict[1].loc[output_dict[1][col] == choice]
        for qc_cutoff in input_count_dict[0]:
            output_dict[0][qc_cutoff] = output_dict[0][qc_cutoff].groupby(useful_cols).sum().transpose().reindex(tri_subset)
            if gc_correct == True:
                output_dict[2][qc_cutoff] = output_dict[2][qc_cutoff].groupby(useful_cols).sum().transpose().reindex(tri_subset)
        output_dict[1] = output_dict[1].groupby(useful_cols).sum().transpose().reindex(all_triplets)
    return output_dict

### Count and analyze <a name="mutation_internal_count_analyze"></a>

[Return to Table of Contents](#TOC)

In [None]:
count_internal_all = dict()
norm_internal_all = dict()

#### Random

In [None]:
count_internal_all['random'] = count_internal_mut_pos(pos_expand_all['random'].copy(), useful_cols = ['category'], gc_nmer = 51, combine_LR = True)

norm_internal_all['random'] = mut_norm_conf(count_internal_resum(count_internal_all['random'], ['prediction', 'category']) , min_count = 0, normtorandom = False, gc_correct=True)

normtorandom_internal = pd.Series([norm_internal_all['random'][0][qc_filter]['non_pred']['motif_pos'] for qc_filter in vqslod_list], index = vqslod_list)

norm_internal_all['random_norm'] = mut_norm_conf(count_internal_resum(count_internal_all['random'], ['prediction', 'category']) , min_count = 0, normtorandom = True, random_normaverage = normtorandom_all, gc_correct=True)

#### STR motifs <a name="mutation_internal_count_analyze_STR"></a>

[Return to Table of Contents](#TOC)

In [None]:
count_internal_all['STR'] = count_internal_mut_pos(pos_expand_all['STR'].copy(), useful_cols = ['Strand', 'repeat', 'category', 'status', 'length'], gc_nmer = 51, combine_LR = True)

norm_internal_all['STR'] = mut_norm_conf(count_internal_all['STR'], min_count = 10, normtorandom = True, random_normaverage=normtorandom_all, gc_correct=True)

#### Inverted/Mirror/Direct motifs <a name="mutation_internal_count_analyze_IRDMRDR"></a>

[Return to Table of Contents](#TOC)

In [None]:
count_internal_all['IR'] = count_internal_mut_pos(pos_expand_all['IR'].copy(), useful_cols = ['category', '#MM', 'GC%_stem', 'stem_len', 'spacer'], gc_nmer = 51, combine_LR = True)

norm_internal_all['IR'] = mut_norm_conf(count_internal_resum(count_internal_all['IR'], ['prediction', 'category', '#MM', 'stem_len', 'spacer']) , min_count = 0, normtorandom = True, random_normaverage=normtorandom_all, gc_correct=True)

# Group normalized values by certain data columns
norm_internal_all['IR_stemlen_byspacer'] = mut_norm_conf(count_internal_resum(count_internal_all['IR'], ['prediction', 'category', '#MM', 'spacer', 'stem_len']) , min_count = 0, normtorandom = True, random_normaverage=normtorandom_all, gc_correct=True)
norm_internal_all['IR_stemlen'] = mut_norm_conf(count_internal_resum(count_internal_all['IR'], ['prediction', 'category', '#MM', 'stem_len']) , min_count = 0, normtorandom = True, random_normaverage=normtorandom_all, gc_correct=True)
norm_internal_all['IR_spacer'] = mut_norm_conf(count_internal_resum(count_internal_all['IR'], ['prediction', 'category', '#MM', 'spacer']) , min_count = 0, normtorandom = True, random_normaverage=normtorandom_all, gc_correct=True)
norm_internal_all['IR_GC'] = mut_norm_conf(count_internal_resum(count_internal_all['IR'], ['prediction', 'category', '#MM', 'GC%_stem']) , min_count = 0, normtorandom = True, random_normaverage=normtorandom_all, gc_correct=True)

# IRs with spacer length requirements

count_internal_IR_spacer10 = count_internal_resum(count_internal_all['IR'], ['prediction', 'category', '#MM', 'stem_len', 'spacer'], selections=[False, False, 1, False, list(range(11))])
count_internal_IR_spacer10 = count_internal_resum(count_internal_IR_spacer10, ['prediction', 'category', 'stem_len'])

norm_internal_all['IR_stemlen_spacer10'] = mut_norm_conf(count_internal_IR_spacer10 , min_count = 10, normtorandom = True, random_normaverage=normtorandom_all, gc_correct=True)

count_internal_IR_spacer3 = count_internal_resum(count_internal_all['IR'], ['prediction', 'category', '#MM', 'stem_len', 'spacer'], selections=[False, False, 1, False, list(range(4))])
count_internal_IR_spacer3 = count_internal_resum(count_internal_IR_spacer3, ['prediction', 'category', 'stem_len'])

count_internal_IR_spacer410 = count_internal_resum(count_internal_all['IR'], ['prediction', 'category', '#MM', 'stem_len', 'spacer'], selections=[False, False, 1, False, list(range(4,11))])
count_internal_IR_spacer410 = count_internal_resum(count_internal_IR_spacer410, ['prediction', 'category', 'stem_len'])

count_internal_IR_spacer_over10 = count_internal_resum(count_internal_all['IR'], ['prediction', 'category', '#MM', 'stem_len', 'spacer'], selections=[False, False, 1, False, list(range(11,100))])
count_internal_IR_spacer_over10 = count_internal_resum(count_internal_IR_spacer_over10, ['prediction', 'category', 'stem_len'])

norm_internal_all['IR_stemlen_spacer3'] = mut_norm_conf(count_internal_IR_spacer3 , min_count = 20, normtorandom = True, random_normaverage=normtorandom_all, gc_correct=True)
norm_internal_all['IR_stemlen_spacer410'] = mut_norm_conf(count_internal_IR_spacer410 , min_count = 20, normtorandom = True, random_normaverage=normtorandom_all, gc_correct=True)
norm_internal_all['IR_stemlen_spacer_over10'] = mut_norm_conf(count_internal_IR_spacer_over10 , min_count = 20, normtorandom = True, random_normaverage=normtorandom_all, gc_correct=True)

In [None]:
count_internal_all['DR'] = count_internal_mut_pos(pos_expand_all['DR'].copy(), useful_cols = ['category', '#MM', 'GC%_stem', 'stem_len', 'spacer'], gc_nmer = 51, combine_LR = True)

# Group normalized values by certain data columns
norm_internal_all['DR_stemlen'] = mut_norm_conf(count_internal_resum(count_internal_all['DR'], ['prediction', 'category', '#MM', 'stem_len']) , min_count = 0, normtorandom = True, random_normaverage=normtorandom_all, gc_correct=True)
norm_internal_all['DR_spacer'] = mut_norm_conf(count_internal_resum(count_internal_all['DR'], ['prediction', 'category', '#MM', 'spacer']) , min_count = 0, normtorandom = True, random_normaverage=normtorandom_all, gc_correct=True)
norm_internal_all['DR_GC'] = mut_norm_conf(count_internal_resum(count_internal_all['DR'], ['prediction', 'category', '#MM', 'GC%_stem']) , min_count = 0, normtorandom = True, random_normaverage=normtorandom_all, gc_correct=True)

# DRs with spacer length requirements

count_internal_DR_spacer3 = count_internal_resum(count_internal_all['DR'], ['prediction', 'category', '#MM', 'stem_len', 'spacer'], selections=[False, False, 1, False, list(range(4))])
count_internal_DR_spacer3 = count_internal_resum(count_internal_DR_spacer3, ['prediction', 'category', 'stem_len'])

count_internal_DR_spacer410 = count_internal_resum(count_internal_all['DR'], ['prediction', 'category', '#MM', 'stem_len', 'spacer'], selections=[False, False, 1, False, list(range(4,11))])
count_internal_DR_spacer410 = count_internal_resum(count_internal_DR_spacer410, ['prediction', 'category', 'stem_len'])

count_internal_DR_spacer_over10 = count_internal_resum(count_internal_all['DR'], ['prediction', 'category', '#MM', 'stem_len', 'spacer'], selections=[False, False, 1, False, list(range(11,100))])
count_internal_DR_spacer_over10 = count_internal_resum(count_internal_DR_spacer_over10, ['prediction', 'category', 'stem_len'])

norm_internal_all['DR_stemlen_spacer3'] = mut_norm_conf(count_internal_DR_spacer3 , min_count = 20, normtorandom = True, random_normaverage=normtorandom_all, gc_correct=True)
norm_internal_all['DR_stemlen_spacer410'] = mut_norm_conf(count_internal_DR_spacer410 , min_count = 20, normtorandom = True, random_normaverage=normtorandom_all, gc_correct=True)
norm_internal_all['DR_stemlen_spacer_over10'] = mut_norm_conf(count_internal_DR_spacer_over10 , min_count = 20, normtorandom = True, random_normaverage=normtorandom_all, gc_correct=True)

In [None]:
count_internal_all['MR'] = count_internal_mut_pos(pos_expand_all['MR'].copy(), useful_cols = ['category', '#MM', 'GC%_stem', 'purine', 'stem_len', 'spacer'], gc_nmer = 51, combine_LR = True)

# Group normalized values by certain data columns
norm_internal_all['MR_stemlen'] = mut_norm_conf(count_internal_resum(count_internal_all['MR'], ['prediction', 'category', '#MM', 'stem_len']) , min_count = 0, normtorandom = True, random_normaverage=normtorandom_all, gc_correct=True)
norm_internal_all['MR_spacer'] = mut_norm_conf(count_internal_resum(count_internal_all['MR'], ['prediction', 'category', '#MM', 'spacer']) , min_count = 0, normtorandom = True, random_normaverage=normtorandom_all, gc_correct=True)
norm_internal_all['MR_GC'] = mut_norm_conf(count_internal_resum(count_internal_all['MR'], ['prediction', 'category', '#MM', 'GC%_stem']) , min_count = 0, normtorandom = True, random_normaverage=normtorandom_all, gc_correct=True)
norm_internal_all['MR_purine'] = mut_norm_conf(count_internal_resum(count_internal_all['MR'], ['prediction', 'category', '#MM', 'purine']) , min_count = 0, normtorandom = True, random_normaverage=normtorandom_all, gc_correct=True)

# MRs with spacer length requirements
count_internal_MR_spacer3 = count_internal_resum(count_internal_all['MR'], ['prediction', 'category', '#MM', 'stem_len', 'spacer'], selections=[False, False, 1, False, list(range(4))])
count_internal_MR_spacer3 = count_internal_resum(count_internal_MR_spacer3, ['prediction', 'category', 'stem_len'])

count_internal_MR_spacer410 = count_internal_resum(count_internal_all['MR'], ['prediction', 'category', '#MM', 'stem_len', 'spacer'], selections=[False, False, 1, False, list(range(4,11))])
count_internal_MR_spacer410 = count_internal_resum(count_internal_MR_spacer410, ['prediction', 'category', 'stem_len'])

count_internal_MR_spacer_over10 = count_internal_resum(count_internal_all['MR'], ['prediction', 'category', '#MM', 'stem_len', 'spacer'], selections=[False, False, 1, False, list(range(11,100))])
count_internal_MR_spacer_over10 = count_internal_resum(count_internal_MR_spacer_over10, ['prediction', 'category', 'stem_len'])

norm_internal_all['MR_stemlen_spacer3'] = mut_norm_conf(count_internal_MR_spacer3 , min_count = 20, normtorandom = True, random_normaverage=normtorandom_all, gc_correct=True)
norm_internal_all['MR_stemlen_spacer410'] = mut_norm_conf(count_internal_MR_spacer410 , min_count = 20, normtorandom = True, random_normaverage=normtorandom_all, gc_correct=True)
norm_internal_all['MR_stemlen_spacer_over10'] = mut_norm_conf(count_internal_MR_spacer_over10 , min_count = 20, normtorandom = True, random_normaverage=normtorandom_all, gc_correct=True)

#### G4 motifs <a name="mutation_internal_count_analyze_G4"></a>

[Return to Table of Contents](#TOC)

In [None]:
count_internal_all['G4'] = count_internal_mut_pos(pos_expand_all['G4'].copy(), useful_cols = ['Strand', 'status', 'category', 'length'], gc_nmer = 51, combine_LR = True)

# Select G4s appearing in K+ conditions
count_internal_all['G4_K+'] = count_internal_resum(count_internal_all['G4'], ['status', 'category'], selections = [['K+', 'both'], False])
count_internal_all['G4_K+'] = count_internal_resum(count_internal_all['G4_K+'], ['category'])

In [None]:
# Separate 5' and 3' positions based on trinucleotide (position of G in triplet is determinant)
norm_internal_G4_positions = dict()
norm_internal_G4_positions["run 5'"] = mut_norm_conf(count_internal_resum(count_internal_all['G4_K+'], ['category'], selections = ['run_positions_edge']), tri_subset = [mut for mut in triplet_mutations_und if (mut[0] != 'G')], min_count = 0, normtorandom = True, random_normaverage=normtorandom_all, gc_correct=True)
norm_internal_G4_positions["run 3'"] = mut_norm_conf(count_internal_resum(count_internal_all['G4_K+'], ['category'], selections = ['run_positions_edge']), tri_subset = [mut for mut in triplet_mutations_und if (mut[2] != 'G')], min_count = 0, normtorandom = True, random_normaverage=normtorandom_all, gc_correct=True)
norm_internal_G4_positions["loop 5'"] = mut_norm_conf(count_internal_resum(count_internal_all['G4_K+'], ['category'], selections = ['loop_positions_edge']), tri_subset = [mut for mut in triplet_mutations_und if (mut[2] != 'G')], min_count = 0, normtorandom = True, random_normaverage=normtorandom_all, gc_correct=True)
norm_internal_G4_positions["loop 3'"] = mut_norm_conf(count_internal_resum(count_internal_all['G4_K+'], ['category'], selections = ['loop_positions_edge']), tri_subset = [mut for mut in triplet_mutations_und if (mut[0] != 'G')], min_count = 0, normtorandom = True, random_normaverage=normtorandom_all, gc_correct=True)
norm_internal_G4_positions["loop 1nt"] = mut_norm_conf(count_internal_resum(count_internal_all['G4_K+'], ['category'], selections = ['loop_positions_edge']), tri_subset = [mut for mut in triplet_mutations_und if (mut[0] == 'G') & (mut[2] == 'G')], min_count = 0, normtorandom = True, random_normaverage=normtorandom_all, gc_correct=True)
norm_internal_G4_positions['run_positions_middle'] =  mut_norm_conf(count_internal_resum(count_internal_all['G4_K+'], ['category'], selections = ['run_positions_middle']), min_count = 0, normtorandom = True, random_normaverage=normtorandom_all, gc_correct=True)
norm_internal_G4_positions['loop_positions_middle'] =  mut_norm_conf(count_internal_resum(count_internal_all['G4_K+'], ['category'], selections = ['loop_positions_middle']), min_count = 0, normtorandom = True, random_normaverage=normtorandom_all, gc_correct=True)

# Reorganize data structure
G4_positionorder = ["run 5'", "run_positions_middle", "run 3'", "loop 5'", "loop_positions_middle",  "loop 3'", "loop 1nt"]
G4_norm_summary = dict()
G4_norm_summary[0] = pd.DataFrame(); G4_norm_summary[1] = pd.DataFrame(); G4_norm_summary[2] = pd.DataFrame()
for QC_cutoff in vqslod_list:
    G4_norm_summary[0][QC_cutoff] = pd.Series(list(pd.concat([norm_internal_G4_positions[category][0][QC_cutoff] for category in norm_internal_G4_positions])), index = list(norm_internal_G4_positions)).reindex(G4_positionorder)
    G4_norm_summary[1][QC_cutoff] = pd.Series(list(pd.concat([norm_internal_G4_positions[category][1][QC_cutoff] for category in norm_internal_G4_positions])), index = list(norm_internal_G4_positions)).reindex(G4_positionorder)
    G4_norm_summary[2][QC_cutoff] = pd.Series(list(pd.concat([norm_internal_G4_positions[category][2][QC_cutoff] for category in norm_internal_G4_positions])), index = list(norm_internal_G4_positions)).reindex(G4_positionorder)

In [None]:
# Triplet-level data for G4 positions
norm_internal_G4_positions_tri = dict()
norm_internal_G4_positions_tri["run 5'"] = mut_norm_conf(count_internal_resum(count_internal_all['G4_K+'], ['category'], selections = ['run_positions_edge']), tri_subset = [mut for mut in triplet_mutations_und if (mut[0] != 'G')], min_count = 10, normtorandom = True, random_normaverage=normtorandom_all, gc_correct=True, output_div=True)
norm_internal_G4_positions_tri["run 3'"] = mut_norm_conf(count_internal_resum(count_internal_all['G4_K+'], ['category'], selections = ['run_positions_edge']), tri_subset = [mut for mut in triplet_mutations_und if (mut[2] != 'G')], min_count = 10, normtorandom = True, random_normaverage=normtorandom_all, gc_correct=True, output_div=True)
norm_internal_G4_positions_tri["loop 5'"] = mut_norm_conf(count_internal_resum(count_internal_all['G4_K+'], ['category'], selections = ['loop_positions_edge']), tri_subset = [mut for mut in triplet_mutations_und if (mut[2] != 'G')], min_count = 10, normtorandom = True, random_normaverage=normtorandom_all, gc_correct=True, output_div=True)
norm_internal_G4_positions_tri["loop 3'"] = mut_norm_conf(count_internal_resum(count_internal_all['G4_K+'], ['category'], selections = ['loop_positions_edge']), tri_subset = [mut for mut in triplet_mutations_und if (mut[0] != 'G')], min_count = 10, normtorandom = True, random_normaverage=normtorandom_all, gc_correct=True, output_div=True)
norm_internal_G4_positions_tri["loop 1nt"] = mut_norm_conf(count_internal_resum(count_internal_all['G4_K+'], ['category'], selections = ['loop_positions_edge']), tri_subset = [mut for mut in triplet_mutations_und if (mut[0] == 'G') & (mut[2] == 'G')], min_count = 10, normtorandom = True, random_normaverage=normtorandom_all, gc_correct=True, output_div=True)
norm_internal_G4_positions_tri['run_positions_middle'] =  mut_norm_conf(count_internal_resum(count_internal_all['G4_K+'], ['category'], selections = ['run_positions_middle']), min_count = 10, normtorandom = True, random_normaverage=normtorandom_all, gc_correct=True, output_div=True)
norm_internal_G4_positions_tri['loop_positions_middle'] =  mut_norm_conf(count_internal_resum(count_internal_all['G4_K+'], ['category'], selections = ['loop_positions_middle']), min_count = 10, normtorandom = True, random_normaverage=normtorandom_all, gc_correct=True, output_div=True)

# Reorganize data structure
G4_norm_summary_tri = dict()
for QC_cutoff in vqslod_list:
    G4_norm_summary_tri[QC_cutoff] = pd.concat([norm_internal_G4_positions_tri[category][0][QC_cutoff] for category in norm_internal_G4_positions_tri], axis=1) / normtorandom_all[QC_cutoff]
    G4_norm_summary_tri[QC_cutoff].columns = list(norm_internal_G4_positions_tri)

#### Z-DNA motifs <a name="mutation_internal_count_analyze_ZDNA"></a>

[Return to Table of Contents](#TOC)

In [None]:
count_internal_all['ZDNA'] = count_internal_mut_pos(pos_expand_all['ZDNA'].copy(), useful_cols = ['category', 'length'], gc_nmer = 51, combine_LR = True)
count_internal_all['ZDNA_GY'] = count_internal_mut_pos(pos_expand_all['ZDNA_GY'].copy(), useful_cols = ['Strand', 'category', 'length'], gc_nmer = 51, combine_LR = True)

norm_internal_all['ZDNA'] = mut_norm_conf(count_internal_all['ZDNA'], min_count = 10, normtorandom = True, random_normaverage=normtorandom_all, gc_correct=True)

#### Save/load internal mutation counts

In [None]:
# Save temporary output of the mutation counts
with open('./analysis/temp/mut_internal_counts_ACcor_chr'+str(chr_range)+'-22.pickle', 'wb') as handle:
    pickle.dump(count_internal_all, handle, protocol=pickle.HIGHEST_PROTOCOL)

# Save temporary output of the normalized mutation counts
with open('./analysis/temp/mut_internal_norm_ACcor_chr'+str(chr_range)+'-22.pickle', 'wb') as handle:
    pickle.dump(norm_internal_all, handle, protocol=pickle.HIGHEST_PROTOCOL)

In [None]:
# Load temporary output of the mutation counts
with open('./analysis/temp/mut_internal_counts_ACcor_chr'+str(chr_range)+'-22.pickle', 'rb') as handle:
    count_internal_all = pickle.load(handle)

# Load temporary output of the mutation counts
with open('./analysis/temp/mut_internal_norm_ACcor_chr'+str(chr_range)+'-22.pickle', 'rb') as handle:
    norm_internal_all = pickle.load(handle)

## Plot mutation frequency within motifs <a name="mutation_internal_count_plot"></a>

[Return to Table of Contents](#TOC)

In [None]:
# Assign colors to QC cutoffs
QC_colors_internal = make_colorscale(vqslod_list, 0.5)
QC_colors_internal = pd.DataFrame(QC_colors_internal).transpose()
QC_colors_internal['name'] = ['no QC', 'pass', 'VQSLOD >0', 'VQSLOD >4']
QC_colors_internal_passfail = QC_colors_internal.reindex([-np.inf, -2.774])

In [None]:
# Function to generate individual plot traces
def plot_internal_add(motif, plot_range, useful_cols, row_n, col_n, input_dict, showleg, plot_name, pred_factor, QC_colors_current = QC_colors_internal):
    for QCfilter in QC_colors_current.index:
        plot_series = input_dict[motif][0][QCfilter]; plot_high = input_dict[motif][2][QCfilter]; plot_low = input_dict[motif][1][QCfilter]
        for level in useful_cols:
            plot_series = plot_series.loc[level]; plot_high = plot_high.loc[level]; plot_low = plot_low.loc[level]
        plot_series = plot_series.reindex(list(range(plot_range[0], plot_range[1]+1))); plot_high = plot_high.reindex(list(range(plot_range[0], plot_range[1]+1))); plot_low = plot_low.reindex(list(range(plot_range[0], plot_range[1]+1)))
        
        plot_name.add_trace(go.Scatter(x = plot_series.index, y = plot_series * pred_factor, name = QC_colors_current['name'][QCfilter], marker = dict(color = QC_colors_current[0][QCfilter]), mode = 'lines', legendgroup = QCfilter, showlegend=showleg,
        error_y=dict(type='data', symmetric=False, array = (plot_high -  plot_series) * pred_factor, arrayminus =  (plot_series - plot_low) * pred_factor, color=QC_colors_current[1][QCfilter], thickness=1.5, width=3),), row = row_n, col = col_n)
    plot_name.update_xaxes(range = [plot_range[0]-1, plot_range[1]+1], row = row_n, col = col_n)
    plot_name.add_shape(type='line', x0=plot_range[0]-1, y0=1, x1=plot_range[1]+1, y1=1, line=dict(color='Black', width = .5), row = row_n, col = col_n)

# Function to generate individual plots
def plot_internal(motif, useful_cols, plot_range = [10,20], input_dict = norm_internal_all, row_n = 1, col_n = 1, pred_factor = 1, QC_colors_current = QC_colors_internal, log = False, yrange = [-0.55,12]):
    mutnorm_internal_fig = make_subplots()
    plot_internal_add(motif = motif, useful_cols = useful_cols, row_n = 1, col_n = 1, plot_range = plot_range, input_dict = input_dict, showleg = True, plot_name = mutnorm_internal_fig, pred_factor = pred_factor, QC_colors_current = QC_colors_current)
    mutnorm_internal_fig.add_shape(type='line', x0=0, y0=1, x1=50, y1=1, line=dict(color='Black', width = .3), row=row_n, col=col_n)
    mutnorm_internal_fig.update_xaxes(dtick = 5, range = [plot_range[0]-1, plot_range[1]+1])
    mutnorm_internal_fig.update_yaxes(range = yrange, tickmode = 'array', tickvals = [1,5,10], zeroline = False,  type = 'log' if log == True else 'linear')
    return mutnorm_internal_fig

#### STR A-mononucleotide motifs (Fig 3A) <a name="mutation_internal_count_plot_3A"></a>

[Return to Table of Contents](#TOC)

In [None]:
inframe_mut_fig3a = make_subplots(rows=1, cols=5, shared_yaxes=True, shared_xaxes = True, vertical_spacing = 0.025, horizontal_spacing = 0.02, subplot_titles = ['Flank', 'Start/end', 'Within motif', 'Perfecting', 'Non-perfecting'])
plot_internal_add('STR', [6,31], ['non_pred', 'A', 'flank_pos', 'perfect'], 1, 1, norm_internal_all, showleg = False, plot_name = inframe_mut_fig3a, pred_factor = 1)
plot_internal_add('STR', [6,31], ['non_pred', 'A', 'motif_pos', 'perfect'], 1, 2, norm_internal_all, showleg = False, plot_name = inframe_mut_fig3a, pred_factor = 1)
plot_internal_add('STR', [6,30], ['non_pred', 'A', 'motif_pos_genome_middle', 'perfect'], 1, 3, norm_internal_all, showleg = False, plot_name = inframe_mut_fig3a, pred_factor = 1)
plot_internal_add('STR', [10,30], ['pred', 'A', 'MM_pos_genome', 'inframe'], 1, 4, norm_internal_all, showleg = True, plot_name = inframe_mut_fig3a, pred_factor = 3)
plot_internal_add('STR', [10,30], ['against_pred', 'A', 'MM_pos_genome', 'inframe'], 1, 5, norm_internal_all, showleg = False, plot_name = inframe_mut_fig3a, pred_factor = 1.5)
inframe_mut_fig3a.update_yaxes(title = dict(text = 'A-mono', font = dict(size = 18), standoff = 0), row = 1, col = 1)

inframe_mut_fig3a.update_yaxes(zeroline = False, range = [-0.5,11], dtick = 2)
inframe_mut_fig3a.update_xaxes(title = dict(text = 'motif length', font = dict(size = 14), standoff = 0), row = 1, col = 1)
inframe_mut_fig3a.update_layout(width = 750, height = 200, margin = dict(l = 35, r = 25, b = 0, t = 20), legend=dict(y = -0.10, x = 0.25, orientation='h'))
inframe_mut_fig3a.show()

In [None]:
inframe_mut_fig3a.write_image('./plots/revision_ACcor_internal_mutation_fig_3a.png', format='png', scale = 10, engine = 'orca')

#### STR motifs (Fig. S3B) <a name="mutation_internal_count_plot_S3B"></a>

[Return to Table of Contents](#TOC)

In [None]:
# Unused figure: additional STRs
repeats_figlist = ['A', 'C', 'AT', 'AC', 'AG', 'ACC', 'AGG', 'ATC', 'AGC', 'AAG', 'AAC', 'AAT', 'AAAT']

inframe_mut_figS3b = make_subplots(rows=len(repeats_figlist), cols=5, shared_yaxes=True, shared_xaxes = True, vertical_spacing = 0.015, horizontal_spacing = 0.02, subplot_titles = ['Flank', 'Start/end', 'Within motif', 'Perfecting', 'Non-perfecting'])
counter = 0
for repeat in repeats_figlist:
    counter +=1
    plot_internal_add('STR', [8,31], ['non_pred', repeat, 'flank_pos', 'perfect'], counter, 1, norm_internal_all, showleg = True if counter == 1 else False, plot_name = inframe_mut_figS3b, pred_factor = 1)
    plot_internal_add('STR', [8,31], ['non_pred', repeat, 'motif_pos', 'perfect'], counter, 2, norm_internal_all, showleg = False, plot_name = inframe_mut_figS3b, pred_factor = 1)
    plot_internal_add('STR', [8,30], ['non_pred', repeat, 'motif_pos_genome_middle', 'perfect'], counter, 3, norm_internal_all, showleg = False, plot_name = inframe_mut_figS3b, pred_factor = 1)
    plot_internal_add('STR', [14,30], ['pred', repeat, 'MM_pos_genome', 'inframe'], counter, 4, norm_internal_all, showleg = False, plot_name = inframe_mut_figS3b, pred_factor = 3)
    plot_internal_add('STR', [14,30], ['against_pred', repeat, 'MM_pos_genome', 'inframe'], counter, 5, norm_internal_all, showleg = False, plot_name = inframe_mut_figS3b, pred_factor = 1.5)
    inframe_mut_figS3b.update_yaxes(title = dict(text = repeat, font = dict(size = 18), standoff = 0), row = counter, col = 1)

inframe_mut_figS3b.update_yaxes(zeroline = False, range = [-0.5,15.99], dtick = 4)
inframe_mut_figS3b.update_xaxes(dtick = 5)
inframe_mut_figS3b.update_xaxes(title = dict(text = 'motif length', font = dict(size = 14), standoff = 0), row = counter, col = 1)
inframe_mut_figS3b.update_layout(width = 600, height = 100*len(repeats_figlist), margin = dict(l = 35, r = 25, b = 0, t = 20), legend=dict(y = -0.02, x = 0.25, orientation='h'))
inframe_mut_figS3b.show()

In [None]:
inframe_mut_figS3b.write_image('./plots/revision_ACcor_internal_mutation_fig_S3B.png', format='png', scale = 10, engine = 'orca')

#### IR motifs (Fig S5A) <a name="mutation_internal_count_plot_S5A"></a>

[Return to Table of Contents](#TOC)

In [None]:
name_order = ["Flank", 'Motif int.', "Spacer ends", "Flank", 'Motif int.', "Spacer ends", 'Motif', 'Spacer mid']

other_pos_stemlen_figS5a = make_subplots(cols=len(name_order), rows = 4, vertical_spacing = 0, horizontal_spacing = 0.02, subplot_titles = name_order, shared_yaxes=True)

plot_internal_add('IR_stemlen_spacer3', [10,19], ['pred', 'flank_pos'], 1, 1, norm_internal_all, showleg = True, plot_name = other_pos_stemlen_figS5a, pred_factor = 3)
plot_internal_add('IR_stemlen_spacer3', [10,19], ['pred', 'MM_pos'], 1, 2, norm_internal_all, showleg = False, plot_name = other_pos_stemlen_figS5a, pred_factor = 3)
plot_internal_add('IR_stemlen_spacer3', [10,19], ['pred', 'spacer_pos'], 1, 3, norm_internal_all, showleg = False, plot_name = other_pos_stemlen_figS5a, pred_factor = 3)

plot_internal_add('IR_stemlen_spacer3', [10,19], ['against_pred', 'flank_pos'], 1, 4, norm_internal_all, showleg = False, plot_name = other_pos_stemlen_figS5a, pred_factor = 1.5)
plot_internal_add('IR_stemlen_spacer3', [10,19], ['against_pred', 'MM_pos'], 1, 5, norm_internal_all, showleg = False, plot_name = other_pos_stemlen_figS5a, pred_factor = 1.5)
plot_internal_add('IR_stemlen_spacer3', [10,19], ['against_pred', 'spacer_pos'], 1, 6, norm_internal_all, showleg = False, plot_name = other_pos_stemlen_figS5a, pred_factor = 1.5)

plot_internal_add('IR_stemlen_spacer3', [10,19], ['non_pred', 'motif_pos_genome_middle'], 1, 7, norm_internal_all, showleg = False, plot_name = other_pos_stemlen_figS5a, pred_factor = 1)
plot_internal_add('IR_stemlen_spacer3', [10,19], ['non_pred', 'spacer_middle_pos'], 1, 8, norm_internal_all, showleg = False, plot_name = other_pos_stemlen_figS5a, pred_factor = 1)


plot_internal_add('IR_stemlen_spacer410', [10,19], ['pred', 'flank_pos'], 2, 1, norm_internal_all, showleg = False, plot_name = other_pos_stemlen_figS5a, pred_factor = 3)
plot_internal_add('IR_stemlen_spacer410', [10,19], ['pred', 'MM_pos'], 2, 2, norm_internal_all, showleg = False, plot_name = other_pos_stemlen_figS5a, pred_factor = 3)
plot_internal_add('IR_stemlen_spacer410', [10,19], ['pred', 'spacer_pos'], 2, 3, norm_internal_all, showleg = False, plot_name = other_pos_stemlen_figS5a, pred_factor = 3)

plot_internal_add('IR_stemlen_spacer410', [10,19], ['against_pred', 'flank_pos'], 2, 4, norm_internal_all, showleg = False, plot_name = other_pos_stemlen_figS5a, pred_factor = 1.5)
plot_internal_add('IR_stemlen_spacer410', [10,19], ['against_pred', 'MM_pos'], 2, 5, norm_internal_all, showleg = False, plot_name = other_pos_stemlen_figS5a, pred_factor = 1.5)
plot_internal_add('IR_stemlen_spacer410', [10,19], ['against_pred', 'spacer_pos'], 2, 6, norm_internal_all, showleg = False, plot_name = other_pos_stemlen_figS5a, pred_factor = 1.5)

plot_internal_add('IR_stemlen_spacer410', [10,19], ['non_pred', 'motif_pos_genome_middle'], 2, 7, norm_internal_all, showleg = False, plot_name = other_pos_stemlen_figS5a, pred_factor = 1)
plot_internal_add('IR_stemlen_spacer410', [10,19], ['non_pred', 'spacer_middle_pos'], 2, 8, norm_internal_all, showleg = False, plot_name = other_pos_stemlen_figS5a, pred_factor = 1)


plot_internal_add('IR_stemlen_spacer_over10', [10,19], ['pred', 'flank_pos'], 3, 1, norm_internal_all, showleg = False, plot_name = other_pos_stemlen_figS5a, pred_factor = 3)
plot_internal_add('IR_stemlen_spacer_over10', [10,19], ['pred', 'MM_pos'], 3, 2, norm_internal_all, showleg = False, plot_name = other_pos_stemlen_figS5a, pred_factor = 3)
plot_internal_add('IR_stemlen_spacer_over10', [10,19], ['pred', 'spacer_pos'], 3, 3, norm_internal_all, showleg = False, plot_name = other_pos_stemlen_figS5a, pred_factor = 3)

plot_internal_add('IR_stemlen_spacer_over10', [10,19], ['against_pred', 'flank_pos'], 3, 4, norm_internal_all, showleg = False, plot_name = other_pos_stemlen_figS5a, pred_factor = 1.5)
plot_internal_add('IR_stemlen_spacer_over10', [10,19], ['against_pred', 'MM_pos'], 3, 5, norm_internal_all, showleg = False, plot_name = other_pos_stemlen_figS5a, pred_factor = 1.5)
plot_internal_add('IR_stemlen_spacer_over10', [10,19], ['against_pred', 'spacer_pos'], 3, 6, norm_internal_all, showleg = False, plot_name = other_pos_stemlen_figS5a, pred_factor = 1.5)

plot_internal_add('IR_stemlen_spacer_over10', [10,19], ['non_pred', 'motif_pos_genome_middle'], 3, 7, norm_internal_all, showleg = False, plot_name = other_pos_stemlen_figS5a, pred_factor = 1)
plot_internal_add('IR_stemlen_spacer_over10', [10,19], ['non_pred', 'spacer_middle_pos'], 3, 8, norm_internal_all, showleg = False, plot_name = other_pos_stemlen_figS5a, pred_factor = 1)



plot_internal_add('IR_spacer', [0,11], ['pred', 'flank_pos', 0], 4, 1, norm_internal_all, showleg = False, plot_name = other_pos_stemlen_figS5a, pred_factor = 3)
plot_internal_add('IR_spacer', [0,11], ['pred', 'MM_pos', 1], 4, 2, norm_internal_all, showleg = False, plot_name = other_pos_stemlen_figS5a, pred_factor = 3)
plot_internal_add('IR_spacer', [0,11], ['pred', 'spacer_pos', 0], 4, 3, norm_internal_all, showleg = False, plot_name = other_pos_stemlen_figS5a, pred_factor = 3)

plot_internal_add('IR_spacer', [0,11], ['against_pred', 'flank_pos', 0], 4, 4, norm_internal_all, showleg = False, plot_name = other_pos_stemlen_figS5a, pred_factor = 1.5)
plot_internal_add('IR_spacer', [0,11], ['against_pred', 'MM_pos', 1], 4, 5, norm_internal_all, showleg = False, plot_name = other_pos_stemlen_figS5a, pred_factor = 1.5)
plot_internal_add('IR_spacer', [0,11], ['against_pred', 'spacer_pos', 0], 4, 6, norm_internal_all, showleg = False, plot_name = other_pos_stemlen_figS5a, pred_factor = 1.5)

plot_internal_add('IR_spacer', [0,11], ['non_pred', 'motif_pos_genome_middle', 0], 4, 7, norm_internal_all, showleg = False, plot_name = other_pos_stemlen_figS5a, pred_factor = 1)
plot_internal_add('IR_spacer', [0,11], ['non_pred', 'spacer_middle_pos', 0], 4, 8, norm_internal_all, showleg = False, plot_name = other_pos_stemlen_figS5a, pred_factor = 1)


other_pos_stemlen_figS5a.add_shape(type='line', x0=0.375, y0=-0.025, x1=0.375, y1=1.025, xref='paper', yref='paper', line=dict(color='rgb(150,150,150)', width = 3))
other_pos_stemlen_figS5a.add_shape(type='line', x0=0.755, y0=-0.025, x1=0.76, y1=1.025, xref='paper', yref='paper', line=dict(color='rgb(150,150,150)', width = 3))

other_pos_stemlen_figS5a.update_yaxes(range = [-0.1, 7.5], zeroline = False)

other_pos_stemlen_figS5a.update_yaxes(title = dict(text = 'Spacer 1-3', font = dict(size = 14)), title_standoff = 0, col = 1, row = 1)
other_pos_stemlen_figS5a.update_yaxes(title = dict(text = 'Spacer 4-10', font = dict(size = 14)), title_standoff = 0, col = 1, row = 2)
other_pos_stemlen_figS5a.update_yaxes(title = dict(text = 'Spacer >10', font = dict(size = 14)), title_standoff = 0, col = 1, row = 3)
other_pos_stemlen_figS5a.update_yaxes(title = dict(text = 'Motif >9 nt', font = dict(size = 14)), title_standoff = 0, col = 1, row = 4)

other_pos_stemlen_figS5a.update_xaxes(title = dict(text = 'Motif length', font = dict(size = 14)), title_standoff = 0, col = 2, row = 3)
other_pos_stemlen_figS5a.update_xaxes(title = dict(text = 'Spacer length', font = dict(size = 14)), title_standoff = 0, col = 2, row = 4)
other_pos_stemlen_figS5a.update_xaxes(zeroline = False, row = 4)

other_pos_stemlen_figS5a.update_xaxes(showticklabels=False, row = 1); other_pos_stemlen_figS5a.update_xaxes(showticklabels=False, row = 2)
other_pos_stemlen_figS5a.update_xaxes(range = [9, 21], row = 1); other_pos_stemlen_figS5a.update_xaxes(range = [9, 21], row = 2); other_pos_stemlen_figS5a.update_xaxes(range = [9, 21], row = 3); other_pos_stemlen_figS5a.update_xaxes(range = [-1.5, 12], row = 4)

other_pos_stemlen_figS5a.update_yaxes(domain = [0.01, 0.22], row = 4)
other_pos_stemlen_figS5a.update_yaxes(domain = [0.32, 0.52], row = 3)
other_pos_stemlen_figS5a.update_yaxes(domain = [0.55, 0.75], row = 2)
other_pos_stemlen_figS5a.update_yaxes(domain = [0.78, 0.99], row = 1)

other_pos_stemlen_figS5a.update_layout(title = dict(text = 'Perfecting mutations                        Non-perfecting mutations                  Other mutations', x = 0.125, font = dict(size = 18)),
    height = 550, width = 1000, margin = dict(l = 55, r = 25, b = 25, t = 55), legend=dict(y = -0.03, x = 0.45, orientation='h'))


other_pos_stemlen_figS5a.show()

In [None]:
other_pos_stemlen_figS5a.write_image('./plots/revision_ACcor_internal_mutation_IR_fig_S5a.png', format='png', scale = 10, engine = 'orca')

#### Mutation spectrum for IR spacer mutations (Fig. S5c) <a name="mutation_internal_count_plot_S5C"></a>

[Return to Table of Contents](#TOC)

In [None]:
colors_apobec = colors.copy()
colors_apobec.loc['APOBEC'] = ['purple', ['TCA_T', 'TCT_T', 'TCG_T', 'TCC_T'], [], 7, ['TCA_T', 'TCT_T', 'TCG_T', 'TCC_T'], ['TGA_A', 'AGA_A', 'CGA_A', 'GGA_A'], ['TCA_T', 'TCT_T', 'TCG_T', 'TCC_T', 'TGA_A', 'AGA_A', 'CGA_A', 'GGA_A']]

norm_internal_bymut = dict()
norm_internal_bymut['IR_stemlen_spacer3'] = dict()
for mut_type in colors_apobec.index:
    norm_internal_bymut['IR_stemlen_spacer3'][mut_type] = mut_norm_conf(count_internal_IR_spacer3 , min_count = 5, tri_subset = colors_apobec['ind_all'][mut_type], normtorandom = True, random_normaverage=normtorandom_all, gc_correct=True)

In [None]:
IR_bymut_figS5c = go.Figure()
for mut_type in colors_apobec.index:
    IR_bymut_figS5c.add_trace(go.Scatter(x = norm_internal_bymut['IR_stemlen_spacer3'][mut_type][0][-np.inf]['non_pred']['spacer_middle_pos'].index, y = norm_internal_bymut['IR_stemlen_spacer3'][mut_type][0][-np.inf]['non_pred']['spacer_middle_pos'], name = mut_type, mode = 'lines', opacity = 0.75, line = dict(color = colors_apobec.loc[mut_type]['color'], )))

IR_bymut_figS5c.add_shape(type='line', x0=9, y0=1, x1=16, y1=1, line=dict(color='Black', width = .5))
IR_bymut_figS5c.update_yaxes(zeroline = False, tickmode = 'array', tickvals = [1,3,5,7], title = dict(text = 'Spacer 1-3 mid', standoff = 0, font = dict(size = 14)))
IR_bymut_figS5c.update_xaxes(range = [9.5, 15.5], dtick = 1, title = dict(text = 'Motif length', standoff = 0, font = dict(size = 14)))
IR_bymut_figS5c.update_layout(height = 235, width = 520, margin = dict(l = 35, r = 25, b = 35, t = 25))
IR_bymut_figS5c.show()

In [None]:
IR_bymut_figS5c.write_image('./plots/revision_ACcor_internal_IR_bymut_apobec_figS5c.png', format='png', scale = 10, engine = 'orca')

#### MR/DR motifs (Fig S4a, S6a) <a name="mutation_internal_count_plot_S6A"></a>

[Return to Table of Contents](#TOC)

In [None]:
name_order = ["Flank", 'Motif int.', "Spacer ends", "Flank", 'Motif int.', "Spacer ends", 'Motif', 'Spacer mid']

other_pos_stemlen_figS4a = make_subplots(cols=len(name_order), rows = 4, horizontal_spacing = 0.025, subplot_titles = name_order, shared_yaxes=True)

plot_internal_add('DR_stemlen_spacer3', [10,19], ['pred', 'flank_pos'], 1, 1, norm_internal_all, showleg = True, plot_name = other_pos_stemlen_figS4a, pred_factor = 3)
plot_internal_add('DR_stemlen_spacer3', [10,19], ['pred', 'MM_pos'], 1, 2, norm_internal_all, showleg = False, plot_name = other_pos_stemlen_figS4a, pred_factor = 3)
plot_internal_add('DR_stemlen_spacer3', [10,19], ['pred', 'spacer_pos'], 1, 3, norm_internal_all, showleg = False, plot_name = other_pos_stemlen_figS4a, pred_factor = 3)

plot_internal_add('DR_stemlen_spacer3', [10,19], ['against_pred', 'flank_pos'], 1, 4, norm_internal_all, showleg = False, plot_name = other_pos_stemlen_figS4a, pred_factor = 1.5)
plot_internal_add('DR_stemlen_spacer3', [10,19], ['against_pred', 'MM_pos'], 1, 5, norm_internal_all, showleg = False, plot_name = other_pos_stemlen_figS4a, pred_factor = 1.5)
plot_internal_add('DR_stemlen_spacer3', [10,19], ['against_pred', 'spacer_pos'], 1, 6, norm_internal_all, showleg = False, plot_name = other_pos_stemlen_figS4a, pred_factor = 1.5)

plot_internal_add('DR_stemlen_spacer3', [10,19], ['non_pred', 'motif_pos_genome_middle'], 1, 7, norm_internal_all, showleg = False, plot_name = other_pos_stemlen_figS4a, pred_factor = 1)
plot_internal_add('DR_stemlen_spacer3', [10,19], ['non_pred', 'spacer_middle_pos'], 1, 8, norm_internal_all, showleg = False, plot_name = other_pos_stemlen_figS4a, pred_factor = 1)

plot_internal_add('DR_stemlen_spacer410', [10,19], ['pred', 'flank_pos'], 2, 1, norm_internal_all, showleg = False, plot_name = other_pos_stemlen_figS4a, pred_factor = 3)
plot_internal_add('DR_stemlen_spacer410', [10,19], ['pred', 'MM_pos'], 2, 2, norm_internal_all, showleg = False, plot_name = other_pos_stemlen_figS4a, pred_factor = 3)
plot_internal_add('DR_stemlen_spacer410', [10,19], ['pred', 'spacer_pos'], 2, 3, norm_internal_all, showleg = False, plot_name = other_pos_stemlen_figS4a, pred_factor = 3)

plot_internal_add('DR_stemlen_spacer410', [10,19], ['against_pred', 'flank_pos'], 2, 4, norm_internal_all, showleg = False, plot_name = other_pos_stemlen_figS4a, pred_factor = 1.5)
plot_internal_add('DR_stemlen_spacer410', [10,19], ['against_pred', 'MM_pos'], 2, 5, norm_internal_all, showleg = False, plot_name = other_pos_stemlen_figS4a, pred_factor = 1.5)
plot_internal_add('DR_stemlen_spacer410', [10,19], ['against_pred', 'spacer_pos'], 2, 6, norm_internal_all, showleg = False, plot_name = other_pos_stemlen_figS4a, pred_factor = 1.5)

plot_internal_add('DR_stemlen_spacer410', [10,19], ['non_pred', 'motif_pos_genome_middle'], 2, 7, norm_internal_all, showleg = False, plot_name = other_pos_stemlen_figS4a, pred_factor = 1)
plot_internal_add('DR_stemlen_spacer410', [10,19], ['non_pred', 'spacer_middle_pos'], 2, 8, norm_internal_all, showleg = False, plot_name = other_pos_stemlen_figS4a, pred_factor = 1)

plot_internal_add('DR_stemlen_spacer_over10', [10,19], ['pred', 'flank_pos'], 3, 1, norm_internal_all, showleg = False, plot_name = other_pos_stemlen_figS4a, pred_factor = 3)
plot_internal_add('DR_stemlen_spacer_over10', [10,19], ['pred', 'MM_pos'], 3, 2, norm_internal_all, showleg = False, plot_name = other_pos_stemlen_figS4a, pred_factor = 3)
plot_internal_add('DR_stemlen_spacer_over10', [10,19], ['pred', 'spacer_pos'], 3, 3, norm_internal_all, showleg = False, plot_name = other_pos_stemlen_figS4a, pred_factor = 3)

plot_internal_add('DR_stemlen_spacer_over10', [10,19], ['against_pred', 'flank_pos'], 3, 4, norm_internal_all, showleg = False, plot_name = other_pos_stemlen_figS4a, pred_factor = 1.5)
plot_internal_add('DR_stemlen_spacer_over10', [10,19], ['against_pred', 'MM_pos'], 3, 5, norm_internal_all, showleg = False, plot_name = other_pos_stemlen_figS4a, pred_factor = 1.5)
plot_internal_add('DR_stemlen_spacer_over10', [10,19], ['against_pred', 'spacer_pos'], 3, 6, norm_internal_all, showleg = False, plot_name = other_pos_stemlen_figS4a, pred_factor = 1.5)

plot_internal_add('DR_stemlen_spacer_over10', [10,19], ['non_pred', 'motif_pos_genome_middle'], 3, 7, norm_internal_all, showleg = False, plot_name = other_pos_stemlen_figS4a, pred_factor = 1)
plot_internal_add('DR_stemlen_spacer_over10', [10,19], ['non_pred', 'spacer_middle_pos'], 3, 8, norm_internal_all, showleg = False, plot_name = other_pos_stemlen_figS4a, pred_factor = 1)

plot_internal_add('DR_spacer', [0,11], ['pred', 'flank_pos', 0], 4, 1, norm_internal_all, showleg = False, plot_name = other_pos_stemlen_figS4a, pred_factor = 3)
plot_internal_add('DR_spacer', [0,11], ['pred', 'MM_pos', 1], 4, 2, norm_internal_all, showleg = False, plot_name = other_pos_stemlen_figS4a, pred_factor = 3)
plot_internal_add('DR_spacer', [0,11], ['pred', 'spacer_pos', 0], 4, 3, norm_internal_all, showleg = False, plot_name = other_pos_stemlen_figS4a, pred_factor = 3)

plot_internal_add('DR_spacer', [0,11], ['against_pred', 'flank_pos', 0], 4, 4, norm_internal_all, showleg = False, plot_name = other_pos_stemlen_figS4a, pred_factor = 1.5)
plot_internal_add('DR_spacer', [0,11], ['against_pred', 'MM_pos', 1], 4, 5, norm_internal_all, showleg = False, plot_name = other_pos_stemlen_figS4a, pred_factor = 1.5)
plot_internal_add('DR_spacer', [0,11], ['against_pred', 'spacer_pos', 0], 4, 6, norm_internal_all, showleg = False, plot_name = other_pos_stemlen_figS4a, pred_factor = 1.5)

plot_internal_add('DR_spacer', [0,11], ['non_pred', 'motif_pos_genome_middle', 0], 4, 7, norm_internal_all, showleg = False, plot_name = other_pos_stemlen_figS4a, pred_factor = 1)
plot_internal_add('DR_spacer', [0,11], ['non_pred', 'spacer_middle_pos', 0], 4, 8, norm_internal_all, showleg = False, plot_name = other_pos_stemlen_figS4a, pred_factor = 1)


other_pos_stemlen_figS4a.add_shape(type='line', x0=0.375, y0=-0.025, x1=0.375, y1=1.025, xref='paper', yref='paper', line=dict(color='rgb(150,150,150)', width = 3))
other_pos_stemlen_figS4a.add_shape(type='line', x0=0.76, y0=-0.025, x1=0.76, y1=1.025, xref='paper', yref='paper', line=dict(color='rgb(150,150,150)', width = 3))

other_pos_stemlen_figS4a.update_yaxes(range = [-0.1, 4.25], zeroline = False)

other_pos_stemlen_figS4a.update_yaxes(title = dict(text = 'Spacer 1-3', font = dict(size = 14)), title_standoff = 0, col = 1, row = 1)
other_pos_stemlen_figS4a.update_yaxes(title = dict(text = 'Spacer 4-10', font = dict(size = 14)), title_standoff = 0, col = 1, row = 2)
other_pos_stemlen_figS4a.update_yaxes(title = dict(text = 'Spacer >10', font = dict(size = 14)), title_standoff = 0, col = 1, row = 3)
other_pos_stemlen_figS4a.update_yaxes(title = dict(text = 'Motif >9 nt', font = dict(size = 14)), title_standoff = 0, col = 1, row = 4)

other_pos_stemlen_figS4a.update_xaxes(title = dict(text = 'Motif length', font = dict(size = 14)), title_standoff = 0, col = 2, row = 3)
other_pos_stemlen_figS4a.update_xaxes(title = dict(text = 'Spacer length', font = dict(size = 14)), title_standoff = 0, col = 2, row = 4)
other_pos_stemlen_figS4a.update_xaxes(zeroline = False, row = 4)

other_pos_stemlen_figS4a.update_xaxes(showticklabels=False, row = 1); other_pos_stemlen_figS4a.update_xaxes(showticklabels=False, row = 2)
other_pos_stemlen_figS4a.update_xaxes(range = [9, 21], row = 1); other_pos_stemlen_figS4a.update_xaxes(range = [9, 21], row = 2); other_pos_stemlen_figS4a.update_xaxes(range = [9, 21], row = 3); other_pos_stemlen_figS4a.update_xaxes(range = [-1.5, 12], row = 4)

other_pos_stemlen_figS4a.update_yaxes(domain = [0.01, 0.22], row = 4)
other_pos_stemlen_figS4a.update_yaxes(domain = [0.32, 0.52], row = 3)
other_pos_stemlen_figS4a.update_yaxes(domain = [0.55, 0.75], row = 2)
other_pos_stemlen_figS4a.update_yaxes(domain = [0.78, 0.99], row = 1)

other_pos_stemlen_figS4a.update_layout(title = dict(text = 'Perfecting mutations                        Non-perfecting mutations                  Other mutations', x = 0.125, font = dict(size = 18)),
    height = 550, width = 1000, margin = dict(l = 55, r = 25, b = 25, t = 55), legend=dict(y = -0.03, x = 0.45, orientation='h'))

other_pos_stemlen_figS4a.show()

In [None]:
other_pos_stemlen_figS4a.write_image('./plots/revision_ACcor_DR_SNVfreq_figS4a.png', format='png', scale = 10, engine = 'orca')

In [None]:
name_order = ["Flank", 'Motif int.', "Spacer ends", "Flank", 'Motif int.', "Spacer ends", 'Motif', 'Spacer mid']

other_pos_stemlen_figS6a = make_subplots(cols=len(name_order), rows = 4, vertical_spacing = 0, horizontal_spacing = 0.015, subplot_titles = name_order, shared_yaxes=True)

plot_internal_add('MR_stemlen_spacer3', [10,19], ['pred', 'flank_pos'], 1, 1, norm_internal_all, showleg = True, plot_name = other_pos_stemlen_figS6a, pred_factor = 3)
plot_internal_add('MR_stemlen_spacer3', [10,19], ['pred', 'MM_pos'], 1, 2, norm_internal_all, showleg = False, plot_name = other_pos_stemlen_figS6a, pred_factor = 3)
plot_internal_add('MR_stemlen_spacer3', [10,19], ['pred', 'spacer_pos'], 1, 3, norm_internal_all, showleg = False, plot_name = other_pos_stemlen_figS6a, pred_factor = 3)

plot_internal_add('MR_stemlen_spacer3', [10,19], ['against_pred', 'flank_pos'], 1, 4, norm_internal_all, showleg = False, plot_name = other_pos_stemlen_figS6a, pred_factor = 1.5)
plot_internal_add('MR_stemlen_spacer3', [10,19], ['against_pred', 'MM_pos'], 1, 5, norm_internal_all, showleg = False, plot_name = other_pos_stemlen_figS6a, pred_factor = 1.5)
plot_internal_add('MR_stemlen_spacer3', [10,19], ['against_pred', 'spacer_pos'], 1, 6, norm_internal_all, showleg = False, plot_name = other_pos_stemlen_figS6a, pred_factor = 1.5)

plot_internal_add('MR_stemlen_spacer3', [10,19], ['non_pred', 'motif_pos_genome_middle'], 1, 7, norm_internal_all, showleg = False, plot_name = other_pos_stemlen_figS6a, pred_factor = 1)
plot_internal_add('MR_stemlen_spacer3', [10,19], ['non_pred', 'spacer_middle_pos'], 1, 8, norm_internal_all, showleg = False, plot_name = other_pos_stemlen_figS6a, pred_factor = 1)


plot_internal_add('MR_stemlen_spacer410', [10,19], ['pred', 'flank_pos'], 2, 1, norm_internal_all, showleg = False, plot_name = other_pos_stemlen_figS6a, pred_factor = 3)
plot_internal_add('MR_stemlen_spacer410', [10,19], ['pred', 'MM_pos'], 2, 2, norm_internal_all, showleg = False, plot_name = other_pos_stemlen_figS6a, pred_factor = 3)
plot_internal_add('MR_stemlen_spacer410', [10,19], ['pred', 'spacer_pos'], 2, 3, norm_internal_all, showleg = False, plot_name = other_pos_stemlen_figS6a, pred_factor = 3)

plot_internal_add('MR_stemlen_spacer410', [10,19], ['against_pred', 'flank_pos'], 2, 4, norm_internal_all, showleg = False, plot_name = other_pos_stemlen_figS6a, pred_factor = 1.5)
plot_internal_add('MR_stemlen_spacer410', [10,19], ['against_pred', 'MM_pos'], 2, 5, norm_internal_all, showleg = False, plot_name = other_pos_stemlen_figS6a, pred_factor = 1.5)
plot_internal_add('MR_stemlen_spacer410', [10,19], ['against_pred', 'spacer_pos'], 2, 6, norm_internal_all, showleg = False, plot_name = other_pos_stemlen_figS6a, pred_factor = 1.5)

plot_internal_add('MR_stemlen_spacer410', [10,19], ['non_pred', 'motif_pos_genome_middle'], 2, 7, norm_internal_all, showleg = False, plot_name = other_pos_stemlen_figS6a, pred_factor = 1)
plot_internal_add('MR_stemlen_spacer410', [10,19], ['non_pred', 'spacer_middle_pos'], 2, 8, norm_internal_all, showleg = False, plot_name = other_pos_stemlen_figS6a, pred_factor = 1)


plot_internal_add('MR_stemlen_spacer_over10', [10,19], ['pred', 'flank_pos'], 3, 1, norm_internal_all, showleg = False, plot_name = other_pos_stemlen_figS6a, pred_factor = 3)
plot_internal_add('MR_stemlen_spacer_over10', [10,19], ['pred', 'MM_pos'], 3, 2, norm_internal_all, showleg = False, plot_name = other_pos_stemlen_figS6a, pred_factor = 3)
plot_internal_add('MR_stemlen_spacer_over10', [10,19], ['pred', 'spacer_pos'], 3, 3, norm_internal_all, showleg = False, plot_name = other_pos_stemlen_figS6a, pred_factor = 3)

plot_internal_add('MR_stemlen_spacer_over10', [10,19], ['against_pred', 'flank_pos'], 3, 4, norm_internal_all, showleg = False, plot_name = other_pos_stemlen_figS6a, pred_factor = 1.5)
plot_internal_add('MR_stemlen_spacer_over10', [10,19], ['against_pred', 'MM_pos'], 3, 5, norm_internal_all, showleg = False, plot_name = other_pos_stemlen_figS6a, pred_factor = 1.5)
plot_internal_add('MR_stemlen_spacer_over10', [10,19], ['against_pred', 'spacer_pos'], 3, 6, norm_internal_all, showleg = False, plot_name = other_pos_stemlen_figS6a, pred_factor = 1.5)

plot_internal_add('MR_stemlen_spacer_over10', [10,19], ['non_pred', 'motif_pos_genome_middle'], 3, 7, norm_internal_all, showleg = False, plot_name = other_pos_stemlen_figS6a, pred_factor = 1)
plot_internal_add('MR_stemlen_spacer_over10', [10,19], ['non_pred', 'spacer_middle_pos'], 3, 8, norm_internal_all, showleg = False, plot_name = other_pos_stemlen_figS6a, pred_factor = 1)



plot_internal_add('MR_spacer', [0,11], ['pred', 'flank_pos', 0], 4, 1, norm_internal_all, showleg = False, plot_name = other_pos_stemlen_figS6a, pred_factor = 3)
plot_internal_add('MR_spacer', [0,11], ['pred', 'MM_pos', 1], 4, 2, norm_internal_all, showleg = False, plot_name = other_pos_stemlen_figS6a, pred_factor = 3)
plot_internal_add('MR_spacer', [0,11], ['pred', 'spacer_pos', 0], 4, 3, norm_internal_all, showleg = False, plot_name = other_pos_stemlen_figS6a, pred_factor = 3)

plot_internal_add('MR_spacer', [0,11], ['against_pred', 'flank_pos', 0], 4, 4, norm_internal_all, showleg = False, plot_name = other_pos_stemlen_figS6a, pred_factor = 1.5)
plot_internal_add('MR_spacer', [0,11], ['against_pred', 'MM_pos', 1], 4, 5, norm_internal_all, showleg = False, plot_name = other_pos_stemlen_figS6a, pred_factor = 1.5)
plot_internal_add('MR_spacer', [0,11], ['against_pred', 'spacer_pos', 0], 4, 6, norm_internal_all, showleg = False, plot_name = other_pos_stemlen_figS6a, pred_factor = 1.5)

plot_internal_add('MR_spacer', [0,11], ['non_pred', 'motif_pos_genome_middle', 0], 4, 7, norm_internal_all, showleg = False, plot_name = other_pos_stemlen_figS6a, pred_factor = 1)
plot_internal_add('MR_spacer', [0,11], ['non_pred', 'spacer_middle_pos', 0], 4, 8, norm_internal_all, showleg = False, plot_name = other_pos_stemlen_figS6a, pred_factor = 1)


other_pos_stemlen_figS6a.add_shape(type='line', x0=0.373, y0=-0.025, x1=0.373, y1=1.025, xref='paper', yref='paper', line=dict(color='rgb(150,150,150)', width = 3))
other_pos_stemlen_figS6a.add_shape(type='line', x0=0.755, y0=-0.025, x1=0.755, y1=1.025, xref='paper', yref='paper', line=dict(color='rgb(150,150,150)', width = 3))

other_pos_stemlen_figS6a.update_yaxes(range = [0, 7.5],zeroline = False)

other_pos_stemlen_figS6a.update_yaxes(title = dict(text = 'Spacer 1-3', font = dict(size = 14)), title_standoff = 0, col = 1, row = 1)
other_pos_stemlen_figS6a.update_yaxes(title = dict(text = 'Spacer 4-10', font = dict(size = 14)), title_standoff = 0, col = 1, row = 2)
other_pos_stemlen_figS6a.update_yaxes(title = dict(text = 'Spacer >10', font = dict(size = 14)), title_standoff = 0, col = 1, row = 3)
other_pos_stemlen_figS6a.update_yaxes(title = dict(text = 'Stem >9 nt', font = dict(size = 14)), title_standoff = 0, col = 1, row = 4)

other_pos_stemlen_figS6a.update_xaxes(title = dict(text = 'Stem length', font = dict(size = 14)), title_standoff = 0, col = 2, row = 3)
other_pos_stemlen_figS6a.update_xaxes(title = dict(text = 'Spacer length', font = dict(size = 14)), title_standoff = 0, col = 2, row = 4)
other_pos_stemlen_figS6a.update_xaxes(zeroline = False, row = 4)

other_pos_stemlen_figS6a.update_xaxes(showticklabels=False, row = 1); other_pos_stemlen_figS6a.update_xaxes(showticklabels=False, row = 2)
other_pos_stemlen_figS6a.update_xaxes(range = [9, 21], row = 1); other_pos_stemlen_figS6a.update_xaxes(range = [9, 21], row = 2); other_pos_stemlen_figS6a.update_xaxes(range = [9, 21], row = 3); other_pos_stemlen_figS6a.update_xaxes(range = [-1.5, 12], row = 4)

other_pos_stemlen_figS6a.update_yaxes(domain = [0.01, 0.22], row = 4)
other_pos_stemlen_figS6a.update_yaxes(domain = [0.32, 0.52], row = 3)
other_pos_stemlen_figS6a.update_yaxes(domain = [0.55, 0.75], row = 2)
other_pos_stemlen_figS6a.update_yaxes(domain = [0.78, 0.99], row = 1)

other_pos_stemlen_figS6a.update_layout(title = dict(text = 'Perfecting mutations                        Non-perfecting mutations                  Other mutations', x = 0.125, font = dict(size = 18)),
    height = 500, width = 1016, margin = dict(l = 55, r = 25, b = 25, t = 55), legend=dict(y = -0.03, x = 0.45, orientation='h'))

other_pos_stemlen_figS6a.show()

In [None]:
other_pos_stemlen_figS6a.write_image('./plots/revision_ACcor_MR_SNVfreq_figS6a.png', format='png', scale = 10, engine = 'orca')

#### G4 motifs (Fig. 5A) <a name="mutation_internal_count_plot_5A"></a>

[Return to Table of Contents](#TOC)

In [None]:
G4_QCeffect = G4_norm_summary_tri[-2.774] - G4_norm_summary_tri[-np.inf]

G4_newlistofpositions = ["run 5'", "run_positions_middle", "run 3'", "loop 5'", "loop_positions_middle",  "loop 3'", "loop 1nt"]
G4_newlistofnames = ["5'", 'mid',"3'", "5'", 'mid', "3'", '1nt']

G4_QCeffect_up = dict()
G4_QCeffect_down = dict()

count = -0.35
for position in G4_newlistofpositions:
    G4_QCeffect_up[position] = pd.DataFrame(G4_QCeffect[position].loc[G4_QCeffect[position] >0].copy())
    G4_QCeffect_down[position] = pd.DataFrame(G4_QCeffect[position].loc[G4_QCeffect[position] <0].copy())

    G4_QCeffect_up[position][-np.inf] = G4_norm_summary_tri[-np.inf][position].loc[G4_QCeffect_up[position].index].copy()
    G4_QCeffect_down[position][-np.inf] = G4_norm_summary_tri[-np.inf][position].loc[G4_QCeffect_down[position].index].copy()
    G4_QCeffect_up[position][-2.774] = G4_norm_summary_tri[-2.774][position].loc[G4_QCeffect_up[position].index].copy()
    G4_QCeffect_down[position][-2.774] = G4_norm_summary_tri[-2.774][position].loc[G4_QCeffect_down[position].index].copy()

    G4_QCeffect_up[position]['random'] = count + (0.75* np.random.random(len(G4_QCeffect_up[position])))
    G4_QCeffect_down[position]['random'] = count + (0.75* np.random.random(len(G4_QCeffect_down[position])))

    G4_QCeffect_up[position]['name'] = [str(name).replace('_', '>') if G4_QCeffect_up[position][position][name] >1.85 else '' for name in G4_QCeffect_up[position].index]
    G4_QCeffect_down[position]['name'] = [str(name).replace('_', '>') if G4_QCeffect_down[position][position][name] <-1.85 else '' for name in G4_QCeffect_down[position].index]

    count +=1

# Set y-axis location for particular trinucleotides
G4_QCeffect_down["loop 5'"].loc['GTC_G', 'random'] = 2.65
G4_QCeffect_down["loop 5'"].loc['GTA_G', 'random'] = 3
G4_QCeffect_down["loop 5'"].loc['GTT_G', 'random'] = 3.35

G4_QCeffect_down["loop 3'"].loc['TTG_G', 'random'] = 4.65
G4_QCeffect_down["loop 3'"].loc['TAG_T', 'random'] = 5
G4_QCeffect_down["loop 3'"].loc['CAG_T', 'random'] = 5.35

G4_QCeffect_down['loop 1nt'].loc['GTG_G', 'random'] = 5.8
G4_QCeffect_down['loop 1nt'].loc['GAG_T', 'random'] = 6.15

In [None]:
G4_arrow_fig = go.Figure()

for QC_filter in vqslod_list:
    G4_arrow_fig.add_trace(go.Bar(x = list(range(7)), y = G4_norm_summary[0][QC_filter] -1, base = 1, marker = dict(color = QC_colors_internal[0][QC_filter]), showlegend = True, name = QC_colors_internal['name'][QC_filter], error_y=dict(type='data', symmetric=False, array = pd.Series(G4_norm_summary[2][QC_filter] - G4_norm_summary[0][QC_filter]), arrayminus = pd.Series(G4_norm_summary[0][QC_filter] - G4_norm_summary[1][QC_filter]), color=QC_colors_internal[1][QC_filter], thickness=1.5, width=3)))

list_of_annotations = []
for pos in G4_QCeffect_down:
    current_group = G4_QCeffect_down[pos]
    if len(current_group) > 0:
        current_lines = [dict(x=current_group['random'][mut], ax=current_group['random'][mut], y=current_group[-2.774][mut], ay=current_group[-np.inf][mut], xref='x1', yref='y1', axref='x1', ayref='y1', text = current_group['name'][mut], showarrow=True, arrowhead=2, arrowsize=1, arrowwidth=1.5, arrowcolor='rgba(255,0,0,0.5)', font = dict(color = 'rgb(0,0,0)')) for mut in current_group.index]
        list_of_annotations = list_of_annotations + current_lines
    current_group = G4_QCeffect_up[pos]
    if len(current_group) > 0:
        current_lines = [dict(x=current_group['random'][mut], ax=current_group['random'][mut], y=current_group[-2.774][mut], ay=current_group[-np.inf][mut], xref='x1', yref='y1', axref='x1', ayref='y1', text = current_group['name'][mut], showarrow=True, arrowhead=2, arrowsize=1, arrowwidth=1.5, arrowcolor='rgba(0,0,0,0.5)', font = dict(color = 'rgb(0,0,0)')) for mut in current_group.index]
        list_of_annotations = list_of_annotations + current_lines

list_of_annotations = list_of_annotations + [dict(text = 'G4 stem                                     spacer', font = dict(size = 18), x = 0.16, y = 0, showarrow = False, xref = 'paper', yref = 'paper')]
G4_arrow_fig.add_shape(type='line', x0=0.4275, x1=0.4275, y0=-0.1, y1=1, xref='paper', yref='paper', line=dict(color='rgb(150,150,150)', width = 3))

G4_arrow_fig.update_xaxes(tickmode = 'array', tickvals = list(range(7)), ticktext = G4_newlistofnames)
G4_arrow_fig.update_yaxes(range = [0.05, 8.05], dtick = 1, title = dict(text = 'Relative mutation frequency', standoff = 0, font = dict(size = 18)))
G4_arrow_fig.update_layout(annotations= list_of_annotations)

G4_arrow_fig.update_layout(height = 520, width = 700, margin = dict(l = 40, r = 10, b = 10, t = 10), legend=dict(y = -0.04, x = 0.2, orientation='h'))
G4_arrow_fig.update_layout(barmode = 'overlay')

In [None]:
G4_arrow_fig.write_image('./plots/revision_ACcor_internal_mutation_fig_5a.png', format='png', scale = 10, engine = 'orca')

## Indels at internal positions <a name="mutation_internal_indels"></a>

[Return to Table of Contents](#TOC)

#### Functions to count indels at internal positions  <a name="mutation_internal_indels_functions"></a>

In [None]:
def count_internal_indel_pos_chrom(chrom, pos_current, input_mut_dict, useful_cols, qc_cutoff_list, pos_col):
    # for each qc_cutoff in the mutation dataset, find mutations overlapping the search coordinates

    current_mut_chrom = input_mut_dict[chrom].copy()
    current_mut_chrom.index = current_mut_chrom.index - 1     # change coordinates from base1 to base0

    pos_current = pd.DataFrame(pos_current.groupby([pos_col] + useful_cols).count()['Type']).copy()
    pos_current[['del', 'ins']] = 1
    
    current_mut_sum = dict()
    for qc_cutoff in qc_cutoff_list:
        current_mut_qc = pos_current[['del', 'ins']].mul(current_mut_chrom[qc_cutoff].reindex(pos_current.index.get_level_values(pos_col)).fillna(0).astype(int)).copy()
        current_mut_qc['Tri'] = [tri_function(chrom, pos, base = 0) for pos in current_mut_qc.index.get_level_values(pos_col)]
        current_mut_qc['tri_count'] = 1   
        current_mut_sum[qc_cutoff] = current_mut_qc.loc[current_mut_qc['Tri'].isin(all_triplets)].groupby(useful_cols + ['Tri']).sum().copy()
    return current_mut_sum

In [None]:
def count_internal_indel_pos(input_pos_df, input_mut_dict = variants_indel_slim_AC, qc_cutoff_list = vqslod_list_indel, pos_col = 'pos', chrom_col = 'chrom', strand_col = 'Strand', strand_names = ('+', '-'), useful_cols = ['category'], combine_LR = False):
   
    current_mut_sum_chrom = dict()
    for chrom in range(chr_range,23):
        current_mut_sum_chrom[chrom] = count_internal_indel_pos_chrom(chrom, input_pos_df.loc[input_pos_df[chrom_col] == chrom].copy(), input_mut_dict, pos_col = pos_col, useful_cols = useful_cols, qc_cutoff_list = qc_cutoff_list)
        print('finished chr' + str(chrom) + '    ', end="\r", flush=True)

    current_mut_sum = dict()
    for qc_cutoff in vqslod_list_indel:
        current_mut_sum[qc_cutoff] = pd.concat([current_mut_sum_chrom[chrom][qc_cutoff] for chrom in range(chr_range,23)])
    
    # apply reverse complement to - strand triplets, mutation counts and positions
    if strand_col in useful_cols:
        useful_cols.remove(strand_col)
        internal_position_categories = pd.Series(['MM_pos', 'MM_pos_genome', "3'_motif_pos", "5'_motif_pos", "3'_flank_pos", "5'_flank_pos", 'motif_pos_genome_middle', 'MM_pos_R', 'MM_pos_L', 'spacer_pos', "spacer_3'_pos", "spacer_middle_pos", "spacer_5'_pos", 'run_positions', 'loop_positions', 'run_positions_middle', 'loop_positions_middle', 'run_positions_edge', 'loop_positions_edge'], index = ['MM_pos', 'MM_pos_genome', "5'_motif_pos", "3'_motif_pos", "5'_flank_pos", "3'_flank_pos", 'motif_pos_genome_middle', 'MM_pos_L', 'MM_pos_R', 'spacer_pos', "spacer_5'_pos", "spacer_middle_pos", "spacer_3'_pos", 'run_positions', 'loop_positions', 'run_positions_middle', 'loop_positions_middle', 'run_positions_edge', 'loop_positions_edge'])

        current_mut_sum_strand_F = dict()
        current_mut_sum_strand_R = dict()
        current_mut_sum_bothstrands = dict()
        for qc_cutoff in current_mut_sum:
            current_mut_sum_strand_F[qc_cutoff] = current_mut_sum[qc_cutoff].reset_index().loc[current_mut_sum[qc_cutoff].reset_index()['Strand'] == '+']
            current_mut_sum_strand_R[qc_cutoff] = current_mut_sum[qc_cutoff].reset_index().loc[current_mut_sum[qc_cutoff].reset_index()['Strand'] == '-']
            current_mut_sum_strand_R[qc_cutoff]['category'] = internal_position_categories.reindex(current_mut_sum_strand_R[qc_cutoff]['category']).values
            if 'repeat' in useful_cols:
                current_mut_sum_strand_R[qc_cutoff]['repeat'] = current_mut_sum_strand_R[qc_cutoff]['repeat'].apply(reverse_complement)
            current_mut_sum_strand_R[qc_cutoff]['Tri'] = current_mut_sum_strand_R[qc_cutoff]['Tri'].apply(reverse_complement)
            current_mut_sum_bothstrands[qc_cutoff] = pd.concat([current_mut_sum_strand_F[qc_cutoff], current_mut_sum_strand_R[qc_cutoff]]).groupby(useful_cols + ['Tri']).sum()#.fillna(0)
    else:
        current_mut_sum_bothstrands = dict()
        for qc_cutoff in current_mut_sum:
            current_mut_sum_bothstrands[qc_cutoff] = current_mut_sum[qc_cutoff].reset_index().groupby(useful_cols + ['Tri']).sum()#.fillna(0)
    
    if combine_LR == True:
        for qc_cutoff in current_mut_sum:
            current_mut_sum_bothstrands[qc_cutoff] = current_mut_sum_bothstrands[qc_cutoff].reset_index()
            current_mut_sum_bothstrands[qc_cutoff]['category'] = current_mut_sum_bothstrands[qc_cutoff]['category'].str.replace("3'_", '').str.replace("5'_", '').str.replace('_L', '').str.replace('_R', '')
            current_mut_sum_bothstrands[qc_cutoff] = current_mut_sum_bothstrands[qc_cutoff].groupby(useful_cols + ['Tri']).sum()

    # reformat output to NNN_N rows x pos columns, and split mut counts and trinucleotide counts    

    current_tri_sum = current_mut_sum_bothstrands[qc_cutoff].reset_index().groupby(useful_cols + ['Tri']).sum()['tri_count'].unstack().transpose().fillna(0).astype(int)

    current_mut_sum_reformat = dict()
    for qc_cutoff in current_mut_sum:
        current_mut_sum_reformat[qc_cutoff] = current_mut_sum_bothstrands[qc_cutoff].reset_index().groupby(useful_cols + ['Tri']).sum()[['del', 'ins']].unstack().transpose().fillna(0).astype(int)
        current_mut_sum_reformat[qc_cutoff].index = current_mut_sum_reformat[qc_cutoff].index.get_level_values('Tri') + '_' + current_mut_sum_reformat[qc_cutoff].index.get_level_values(0)
        current_mut_sum_reformat[qc_cutoff] = current_mut_sum_reformat[qc_cutoff].reindex(triplet_mutations_und_indel)
        current_mut_sum_reformat[qc_cutoff].index.name = 'Mut' 
    
    return current_mut_sum_reformat.copy(), current_tri_sum.copy()

#### Counting and analysis of indels  - STRs, DRs, MRs, IRs, ZDNA <a name="mutation_internal_indels_analysis"></a>

[Return to Table of Contents](#TOC)

In [None]:
# Counting
count_internal_indel = dict()

count_internal_indel['STR'] = count_internal_indel_pos(pos_expand_all['STR'].copy(), useful_cols = ['Strand', 'repeat', 'category', 'status', 'length'], combine_LR = True)

count_internal_indel['IR'] = count_internal_indel_pos(pos_expand_all['IR'], variants_indel_slim_AC, useful_cols = ['category', '#MM', 'GC%_stem', 'stem_len', 'spacer'], combine_LR = True)
count_internal_indel['MR'] = count_internal_indel_pos(pos_expand_all['MR'], variants_indel_slim_AC, useful_cols = ['category', '#MM', 'GC%_stem', 'purine', 'stem_len', 'spacer'], combine_LR = True)
count_internal_indel['DR'] = count_internal_indel_pos(pos_expand_all['DR'], variants_indel_slim_AC, useful_cols = ['category', '#MM', 'GC%_stem', 'stem_len', 'spacer'], combine_LR = True)

count_internal_indel['ZDNA'] = count_internal_indel_pos(pos_expand_all['ZDNA'], variants_indel_slim_AC, useful_cols = ['category', 'length'], combine_LR = True)
count_internal_indel['ZDNA_GY'] = count_internal_indel_pos(pos_expand_all['ZDNA_GY'], variants_indel_slim_AC, useful_cols = ['Strand', 'category', 'length'], combine_LR = True)

count_internal_indel['G4'] = count_internal_indel_pos(pos_expand_all['G4'], variants_indel_slim_AC, useful_cols = ['Strand', 'status', 'category', 'length'], combine_LR = True)

In [None]:
# Normalization
norm_internal_indel_ins = dict(); norm_internal_indel_del = dict()

norm_internal_indel_ins['STR'] = mut_norm_conf(count_internal_resum(count_internal_indel['STR'], useful_cols= ['repeat', 'category', 'status', 'length'], gc_correct=False, tri_subset = triplet_mutations_und_indel), snvindel = 'indel', min_count = 10, genome_AC_freq_current = variants_indel_count_AC_freq_all, tri_subset = triplet_mutations_und_ins, normtorandom = True, random_normaverage = normtorandom_indel['ins'], do_binconf = True)
norm_internal_indel_del['STR'] = mut_norm_conf(count_internal_resum(count_internal_indel['STR'], useful_cols= ['repeat', 'category', 'status', 'length'], gc_correct=False, tri_subset = triplet_mutations_und_indel), snvindel = 'indel', min_count = 10, genome_AC_freq_current = variants_indel_count_AC_freq_all, tri_subset = triplet_mutations_und_del, normtorandom = True, random_normaverage = normtorandom_indel['del'], do_binconf = True)

In [None]:
norm_internal_indel_ins['DR_stemlen'] = mut_norm_conf(count_internal_resum(count_internal_indel['DR'], useful_cols= ['category', '#MM', 'stem_len'], gc_correct=False, tri_subset = triplet_mutations_und_indel, selections=[False, 1, False]), snvindel = 'indel', min_count = 10, genome_AC_freq_current = variants_indel_count_AC_freq_all, tri_subset = triplet_mutations_und_ins, normtorandom = True, random_normaverage = normtorandom_indel['ins'], do_binconf = True)
norm_internal_indel_del['DR_stemlen'] = mut_norm_conf(count_internal_resum(count_internal_indel['DR'], useful_cols= ['category', '#MM', 'stem_len'], gc_correct=False, tri_subset = triplet_mutations_und_indel, selections=[False, 1, False]), snvindel = 'indel', min_count = 10, genome_AC_freq_current = variants_indel_count_AC_freq_all, tri_subset = triplet_mutations_und_del, normtorandom = True, random_normaverage = normtorandom_indel['del'], do_binconf = True)

# Loop length restrictions
norm_internal_indel_ins['DR_stemlen_loop10'] = mut_norm_conf(count_internal_resum(count_internal_resum(count_internal_indel['DR'], useful_cols= ['category', '#MM', 'spacer', 'stem_len'], gc_correct=False, tri_subset = triplet_mutations_und_indel, selections=[False, 1, list(range(11)), False]), useful_cols= ['category', '#MM', 'stem_len'], gc_correct=False, tri_subset = triplet_mutations_und_indel), snvindel = 'indel', min_count = 10, genome_AC_freq_current = variants_indel_count_AC_freq_all, tri_subset = triplet_mutations_und_ins, normtorandom = True, random_normaverage = normtorandom_indel['ins'], do_binconf = True)
norm_internal_indel_del['DR_stemlen_loop10'] = mut_norm_conf(count_internal_resum(count_internal_resum(count_internal_indel['DR'], useful_cols= ['category', '#MM', 'spacer', 'stem_len'], gc_correct=False, tri_subset = triplet_mutations_und_indel, selections=[False, 1, list(range(11)), False]), useful_cols= ['category', '#MM', 'stem_len'], gc_correct=False, tri_subset = triplet_mutations_und_indel), snvindel = 'indel', min_count = 10, genome_AC_freq_current = variants_indel_count_AC_freq_all, tri_subset = triplet_mutations_und_del, normtorandom = True, random_normaverage = normtorandom_indel['del'], do_binconf = True)

# Long vs. short insertions
count_internal_indel['DR_longins'] = count_internal_indel_pos(pos_expand_all['DR'], variants_ins_slim_AC_long, useful_cols = ['category', '#MM', 'GC%_stem', 'stem_len', 'spacer'], combine_LR = True)
count_internal_indel['DR_shortins'] = count_internal_indel_pos(pos_expand_all['DR'], variants_ins_slim_AC_short, useful_cols = ['category', '#MM', 'GC%_stem', 'stem_len', 'spacer'], combine_LR = True)

norm_internal_indel_ins['DR_longins_stemlen_loop10'] = mut_norm_conf(count_internal_resum(count_internal_resum(count_internal_indel['DR_longins'], useful_cols= ['category', '#MM', 'spacer', 'stem_len'], gc_correct=False, tri_subset = triplet_mutations_und_indel, selections=[False, 1, list(range(11)), False]), useful_cols= ['category', '#MM', 'stem_len'], gc_correct=False, tri_subset = triplet_mutations_und_indel), snvindel = 'indel', min_count = 10, genome_AC_freq_current = variants_ins_count_AC_freq_long, tri_subset = triplet_mutations_und_ins, normtorandom = True, random_normaverage = normtorandom_indel['ins_long'], do_binconf = True)
norm_internal_indel_ins['DR_shortins_stemlen_loop10'] = mut_norm_conf(count_internal_resum(count_internal_resum(count_internal_indel['DR_shortins'], useful_cols= ['category', '#MM', 'spacer', 'stem_len'], gc_correct=False, tri_subset = triplet_mutations_und_indel, selections=[False, 1, list(range(11)), False]), useful_cols= ['category', '#MM', 'stem_len'], gc_correct=False, tri_subset = triplet_mutations_und_indel), snvindel = 'indel', min_count = 10, genome_AC_freq_current = variants_ins_count_AC_freq_short, tri_subset = triplet_mutations_und_ins, normtorandom = True, random_normaverage = normtorandom_indel['ins_short'], do_binconf = True)

# combine loop edge and middle for main figure
norm_internal_indel_ins['DR_longins_stemlen_loop10_loopcombined'] = mut_norm_conf(count_internal_resum(count_internal_resum(count_internal_indel['DR_longins'], useful_cols= ['category', '#MM', 'spacer', 'stem_len'], gc_correct=False, tri_subset = triplet_mutations_und_indel, selections=[['spacer_pos', 'spacer_pos_middle'], 1, list(range(11)), False]), useful_cols= ['#MM', 'stem_len'], gc_correct=False, tri_subset = triplet_mutations_und_indel), snvindel = 'indel', min_count = 10, genome_AC_freq_current = variants_ins_count_AC_freq_long, tri_subset = triplet_mutations_und_ins, normtorandom = True, random_normaverage = normtorandom_indel['ins_long'], do_binconf = True)
norm_internal_indel_del['DR_stemlen_loop10_loopcombined'] = mut_norm_conf(count_internal_resum(count_internal_resum(count_internal_indel['DR'], useful_cols= ['category', '#MM', 'spacer', 'stem_len'], gc_correct=False, tri_subset = triplet_mutations_und_indel, selections=[['spacer_pos', 'spacer_pos_middle'], 1, list(range(11)), False]), useful_cols= ['#MM', 'stem_len'], gc_correct=False, tri_subset = triplet_mutations_und_indel), snvindel = 'indel', min_count = 10, genome_AC_freq_current = variants_indel_count_AC_freq_all, tri_subset = triplet_mutations_und_del, normtorandom = True, random_normaverage = normtorandom_indel['del'], do_binconf = True)

# Loop length restrictions

norm_internal_indel_ins['DR_stemlen_loopmorethan10'] = mut_norm_conf(count_internal_resum(count_internal_resum(count_internal_indel['DR'], useful_cols= ['category', '#MM', 'spacer', 'stem_len'], gc_correct=False, tri_subset = triplet_mutations_und_indel, selections=[False, 1, list(range(11,100)), False]), useful_cols= ['category', '#MM', 'stem_len'], gc_correct=False, tri_subset = triplet_mutations_und_indel), snvindel = 'indel', min_count = 10, genome_AC_freq_current = variants_indel_count_AC_freq_all, tri_subset = triplet_mutations_und_ins, normtorandom = True, random_normaverage = normtorandom_indel['ins'], do_binconf = True)
norm_internal_indel_del['DR_stemlen_loopmorethan10'] = mut_norm_conf(count_internal_resum(count_internal_resum(count_internal_indel['DR'], useful_cols= ['category', '#MM', 'spacer', 'stem_len'], gc_correct=False, tri_subset = triplet_mutations_und_indel, selections=[False, 1, list(range(11,100)), False]), useful_cols= ['category', '#MM', 'stem_len'], gc_correct=False, tri_subset = triplet_mutations_und_indel), snvindel = 'indel', min_count = 10, genome_AC_freq_current = variants_indel_count_AC_freq_all, tri_subset = triplet_mutations_und_del, normtorandom = True, random_normaverage = normtorandom_indel['del'], do_binconf = True)

norm_internal_indel_ins['DR_stemlen_loop1050'] = mut_norm_conf(count_internal_resum(count_internal_resum(count_internal_indel['DR'], useful_cols= ['category', '#MM', 'spacer', 'stem_len'], gc_correct=False, tri_subset = triplet_mutations_und_indel, selections=[False, 1, list(range(11,50)), False]), useful_cols= ['category', '#MM', 'stem_len'], gc_correct=False, tri_subset = triplet_mutations_und_indel), snvindel = 'indel', min_count = 10, genome_AC_freq_current = variants_indel_count_AC_freq_all, tri_subset = triplet_mutations_und_ins, normtorandom = True, random_normaverage = normtorandom_indel['ins'], do_binconf = True)
norm_internal_indel_del['DR_stemlen_loop1050'] = mut_norm_conf(count_internal_resum(count_internal_resum(count_internal_indel['DR'], useful_cols= ['category', '#MM', 'spacer', 'stem_len'], gc_correct=False, tri_subset = triplet_mutations_und_indel, selections=[False, 1, list(range(11,50)), False]), useful_cols= ['category', '#MM', 'stem_len'], gc_correct=False, tri_subset = triplet_mutations_und_indel), snvindel = 'indel', min_count = 10, genome_AC_freq_current = variants_indel_count_AC_freq_all, tri_subset = triplet_mutations_und_del, normtorandom = True, random_normaverage = normtorandom_indel['del'], do_binconf = True)

norm_internal_indel_ins['DR_stemlen_loop50100'] = mut_norm_conf(count_internal_resum(count_internal_resum(count_internal_indel['DR'], useful_cols= ['category', '#MM', 'spacer', 'stem_len'], gc_correct=False, tri_subset = triplet_mutations_und_indel, selections=[False, 1, list(range(51,100)), False]), useful_cols= ['category', '#MM', 'stem_len'], gc_correct=False, tri_subset = triplet_mutations_und_indel), snvindel = 'indel', min_count = 10, genome_AC_freq_current = variants_indel_count_AC_freq_all, tri_subset = triplet_mutations_und_ins, normtorandom = True, random_normaverage = normtorandom_indel['ins'], do_binconf = True)
norm_internal_indel_del['DR_stemlen_loop50100'] = mut_norm_conf(count_internal_resum(count_internal_resum(count_internal_indel['DR'], useful_cols= ['category', '#MM', 'spacer', 'stem_len'], gc_correct=False, tri_subset = triplet_mutations_und_indel, selections=[False, 1, list(range(51,100)), False]), useful_cols= ['category', '#MM', 'stem_len'], gc_correct=False, tri_subset = triplet_mutations_und_indel), snvindel = 'indel', min_count = 10, genome_AC_freq_current = variants_indel_count_AC_freq_all, tri_subset = triplet_mutations_und_del, normtorandom = True, random_normaverage = normtorandom_indel['del'], do_binconf = True)

norm_internal_indel_ins['DR_stemlen_loop1020'] = mut_norm_conf(count_internal_resum(count_internal_resum(count_internal_indel['DR'], useful_cols= ['category', '#MM', 'spacer', 'stem_len'], gc_correct=False, tri_subset = triplet_mutations_und_indel, selections=[False, 1, list(range(11,20)), False]), useful_cols= ['category', '#MM', 'stem_len'], gc_correct=False, tri_subset = triplet_mutations_und_indel), snvindel = 'indel', min_count = 10, genome_AC_freq_current = variants_indel_count_AC_freq_all, tri_subset = triplet_mutations_und_ins, normtorandom = True, random_normaverage = normtorandom_indel['ins'], do_binconf = True)
norm_internal_indel_del['DR_stemlen_loop1020'] = mut_norm_conf(count_internal_resum(count_internal_resum(count_internal_indel['DR'], useful_cols= ['category', '#MM', 'spacer', 'stem_len'], gc_correct=False, tri_subset = triplet_mutations_und_indel, selections=[False, 1, list(range(11,20)), False]), useful_cols= ['category', '#MM', 'stem_len'], gc_correct=False, tri_subset = triplet_mutations_und_indel), snvindel = 'indel', min_count = 10, genome_AC_freq_current = variants_indel_count_AC_freq_all, tri_subset = triplet_mutations_und_del, normtorandom = True, random_normaverage = normtorandom_indel['del'], do_binconf = True)

norm_internal_indel_ins['DR_stemlen_loop20100'] = mut_norm_conf(count_internal_resum(count_internal_resum(count_internal_indel['DR'], useful_cols= ['category', '#MM', 'spacer', 'stem_len'], gc_correct=False, tri_subset = triplet_mutations_und_indel, selections=[False, 1, list(range(21,100)), False]), useful_cols= ['category', '#MM', 'stem_len'], gc_correct=False, tri_subset = triplet_mutations_und_indel), snvindel = 'indel', min_count = 10, genome_AC_freq_current = variants_indel_count_AC_freq_all, tri_subset = triplet_mutations_und_ins, normtorandom = True, random_normaverage = normtorandom_indel['ins'], do_binconf = True)
norm_internal_indel_del['DR_stemlen_loop20100'] = mut_norm_conf(count_internal_resum(count_internal_resum(count_internal_indel['DR'], useful_cols= ['category', '#MM', 'spacer', 'stem_len'], gc_correct=False, tri_subset = triplet_mutations_und_indel, selections=[False, 1, list(range(21,100)), False]), useful_cols= ['category', '#MM', 'stem_len'], gc_correct=False, tri_subset = triplet_mutations_und_indel), snvindel = 'indel', min_count = 10, genome_AC_freq_current = variants_indel_count_AC_freq_all, tri_subset = triplet_mutations_und_del, normtorandom = True, random_normaverage = normtorandom_indel['del'], do_binconf = True)

norm_internal_indel_ins['DR_stemlen_loop04'] = mut_norm_conf(count_internal_resum(count_internal_resum(count_internal_indel['DR'], useful_cols= ['category', '#MM', 'spacer', 'stem_len'], gc_correct=False, tri_subset = triplet_mutations_und_indel, selections=[False, 1, list(range(5)), False]), useful_cols= ['category', '#MM', 'stem_len'], gc_correct=False, tri_subset = triplet_mutations_und_indel), snvindel = 'indel', min_count = 10, genome_AC_freq_current = variants_indel_count_AC_freq_all, tri_subset = triplet_mutations_und_ins, normtorandom = True, random_normaverage = normtorandom_indel['ins'], do_binconf = True)
norm_internal_indel_del['DR_stemlen_loop04'] = mut_norm_conf(count_internal_resum(count_internal_resum(count_internal_indel['DR'], useful_cols= ['category', '#MM', 'spacer', 'stem_len'], gc_correct=False, tri_subset = triplet_mutations_und_indel, selections=[False, 1, list(range(5)), False]), useful_cols= ['category', '#MM', 'stem_len'], gc_correct=False, tri_subset = triplet_mutations_und_indel), snvindel = 'indel', min_count = 10, genome_AC_freq_current = variants_indel_count_AC_freq_all, tri_subset = triplet_mutations_und_del, normtorandom = True, random_normaverage = normtorandom_indel['del'], do_binconf = True)

norm_internal_indel_ins['DR_stemlen_loop510'] = mut_norm_conf(count_internal_resum(count_internal_resum(count_internal_indel['DR'], useful_cols= ['category', '#MM', 'spacer', 'stem_len'], gc_correct=False, tri_subset = triplet_mutations_und_indel, selections=[False, 1, list(range(5,10)), False]), useful_cols= ['category', '#MM', 'stem_len'], gc_correct=False, tri_subset = triplet_mutations_und_indel), snvindel = 'indel', min_count = 10, genome_AC_freq_current = variants_indel_count_AC_freq_all, tri_subset = triplet_mutations_und_ins, normtorandom = True, random_normaverage = normtorandom_indel['ins'], do_binconf = True)
norm_internal_indel_del['DR_stemlen_loop510'] = mut_norm_conf(count_internal_resum(count_internal_resum(count_internal_indel['DR'], useful_cols= ['category', '#MM', 'spacer', 'stem_len'], gc_correct=False, tri_subset = triplet_mutations_und_indel, selections=[False, 1, list(range(5,10)), False]), useful_cols= ['category', '#MM', 'stem_len'], gc_correct=False, tri_subset = triplet_mutations_und_indel), snvindel = 'indel', min_count = 10, genome_AC_freq_current = variants_indel_count_AC_freq_all, tri_subset = triplet_mutations_und_del, normtorandom = True, random_normaverage = normtorandom_indel['del'], do_binconf = True)

In [None]:
norm_internal_indel_ins['IR_stemlen'] = mut_norm_conf(count_internal_resum(count_internal_indel['IR'], useful_cols= ['category', '#MM', 'stem_len'], gc_correct=False, tri_subset = triplet_mutations_und_indel, selections=[False, 1, False]), snvindel = 'indel', min_count = 10, genome_AC_freq_current = variants_indel_count_AC_freq_all, tri_subset = triplet_mutations_und_ins, normtorandom = True, random_normaverage = normtorandom_indel['ins'], do_binconf = True)
norm_internal_indel_del['IR_stemlen'] = mut_norm_conf(count_internal_resum(count_internal_indel['IR'], useful_cols= ['category', '#MM', 'stem_len'], gc_correct=False, tri_subset = triplet_mutations_und_indel, selections=[False, 1, False]), snvindel = 'indel', min_count = 10, genome_AC_freq_current = variants_indel_count_AC_freq_all, tri_subset = triplet_mutations_und_del, normtorandom = True, random_normaverage = normtorandom_indel['del'], do_binconf = True)

norm_internal_indel_ins['IR_stemlen_loop10'] = mut_norm_conf(count_internal_resum(count_internal_resum(count_internal_indel['IR'], useful_cols= ['category', '#MM', 'spacer', 'stem_len'], gc_correct=False, tri_subset = triplet_mutations_und_indel, selections=[False, 1, list(range(11)), False]), useful_cols= ['category', '#MM', 'stem_len'], gc_correct=False, tri_subset = triplet_mutations_und_indel), snvindel = 'indel', min_count = 10, genome_AC_freq_current = variants_indel_count_AC_freq_all, tri_subset = triplet_mutations_und_ins, normtorandom = True, random_normaverage = normtorandom_indel['ins'], do_binconf = True)
norm_internal_indel_del['IR_stemlen_loop10'] = mut_norm_conf(count_internal_resum(count_internal_resum(count_internal_indel['IR'], useful_cols= ['category', '#MM', 'spacer', 'stem_len'], gc_correct=False, tri_subset = triplet_mutations_und_indel, selections=[False, 1, list(range(11)), False]), useful_cols= ['category', '#MM', 'stem_len'], gc_correct=False, tri_subset = triplet_mutations_und_indel), snvindel = 'indel', min_count = 10, genome_AC_freq_current = variants_indel_count_AC_freq_all, tri_subset = triplet_mutations_und_del, normtorandom = True, random_normaverage = normtorandom_indel['del'], do_binconf = True)

norm_internal_indel_ins['IR_stemlen_loop04'] = mut_norm_conf(count_internal_resum(count_internal_resum(count_internal_indel['IR'], useful_cols= ['category', '#MM', 'spacer', 'stem_len'], gc_correct=False, tri_subset = triplet_mutations_und_indel, selections=[False, 1, list(range(5)), False]), useful_cols= ['category', '#MM', 'stem_len'], gc_correct=False, tri_subset = triplet_mutations_und_indel), snvindel = 'indel', min_count = 10, genome_AC_freq_current = variants_indel_count_AC_freq_all, tri_subset = triplet_mutations_und_ins, normtorandom = True, random_normaverage = normtorandom_indel['ins'], do_binconf = True)
norm_internal_indel_del['IR_stemlen_loop04'] = mut_norm_conf(count_internal_resum(count_internal_resum(count_internal_indel['IR'], useful_cols= ['category', '#MM', 'spacer', 'stem_len'], gc_correct=False, tri_subset = triplet_mutations_und_indel, selections=[False, 1, list(range(5)), False]), useful_cols= ['category', '#MM', 'stem_len'], gc_correct=False, tri_subset = triplet_mutations_und_indel), snvindel = 'indel', min_count = 10, genome_AC_freq_current = variants_indel_count_AC_freq_all, tri_subset = triplet_mutations_und_del, normtorandom = True, random_normaverage = normtorandom_indel['del'], do_binconf = True)

norm_internal_indel_ins['IR_stemlen_loop510'] = mut_norm_conf(count_internal_resum(count_internal_resum(count_internal_indel['IR'], useful_cols= ['category', '#MM', 'spacer', 'stem_len'], gc_correct=False, tri_subset = triplet_mutations_und_indel, selections=[False, 1, list(range(5,10)), False]), useful_cols= ['category', '#MM', 'stem_len'], gc_correct=False, tri_subset = triplet_mutations_und_indel), snvindel = 'indel', min_count = 10, genome_AC_freq_current = variants_indel_count_AC_freq_all, tri_subset = triplet_mutations_und_ins, normtorandom = True, random_normaverage = normtorandom_indel['ins'], do_binconf = True)
norm_internal_indel_del['IR_stemlen_loop510'] = mut_norm_conf(count_internal_resum(count_internal_resum(count_internal_indel['IR'], useful_cols= ['category', '#MM', 'spacer', 'stem_len'], gc_correct=False, tri_subset = triplet_mutations_und_indel, selections=[False, 1, list(range(5,10)), False]), useful_cols= ['category', '#MM', 'stem_len'], gc_correct=False, tri_subset = triplet_mutations_und_indel), snvindel = 'indel', min_count = 10, genome_AC_freq_current = variants_indel_count_AC_freq_all, tri_subset = triplet_mutations_und_del, normtorandom = True, random_normaverage = normtorandom_indel['del'], do_binconf = True)

norm_internal_indel_ins['IR_stemlen_loop1020'] = mut_norm_conf(count_internal_resum(count_internal_resum(count_internal_indel['IR'], useful_cols= ['category', '#MM', 'spacer', 'stem_len'], gc_correct=False, tri_subset = triplet_mutations_und_indel, selections=[False, 1, list(range(11,20)), False]), useful_cols= ['category', '#MM', 'stem_len'], gc_correct=False, tri_subset = triplet_mutations_und_indel), snvindel = 'indel', min_count = 10, genome_AC_freq_current = variants_indel_count_AC_freq_all, tri_subset = triplet_mutations_und_ins, normtorandom = True, random_normaverage = normtorandom_indel['ins'], do_binconf = True)
norm_internal_indel_del['IR_stemlen_loop1020'] = mut_norm_conf(count_internal_resum(count_internal_resum(count_internal_indel['IR'], useful_cols= ['category', '#MM', 'spacer', 'stem_len'], gc_correct=False, tri_subset = triplet_mutations_und_indel, selections=[False, 1, list(range(11,20)), False]), useful_cols= ['category', '#MM', 'stem_len'], gc_correct=False, tri_subset = triplet_mutations_und_indel), snvindel = 'indel', min_count = 10, genome_AC_freq_current = variants_indel_count_AC_freq_all, tri_subset = triplet_mutations_und_del, normtorandom = True, random_normaverage = normtorandom_indel['del'], do_binconf = True)

norm_internal_indel_ins['IR_stemlen_loop20100'] = mut_norm_conf(count_internal_resum(count_internal_resum(count_internal_indel['IR'], useful_cols= ['category', '#MM', 'spacer', 'stem_len'], gc_correct=False, tri_subset = triplet_mutations_und_indel, selections=[False, 1, list(range(21,100)), False]), useful_cols= ['category', '#MM', 'stem_len'], gc_correct=False, tri_subset = triplet_mutations_und_indel), snvindel = 'indel', min_count = 10, genome_AC_freq_current = variants_indel_count_AC_freq_all, tri_subset = triplet_mutations_und_ins, normtorandom = True, random_normaverage = normtorandom_indel['ins'], do_binconf = True)
norm_internal_indel_del['IR_stemlen_loop20100'] = mut_norm_conf(count_internal_resum(count_internal_resum(count_internal_indel['IR'], useful_cols= ['category', '#MM', 'spacer', 'stem_len'], gc_correct=False, tri_subset = triplet_mutations_und_indel, selections=[False, 1, list(range(21,100)), False]), useful_cols= ['category', '#MM', 'stem_len'], gc_correct=False, tri_subset = triplet_mutations_und_indel), snvindel = 'indel', min_count = 10, genome_AC_freq_current = variants_indel_count_AC_freq_all, tri_subset = triplet_mutations_und_del, normtorandom = True, random_normaverage = normtorandom_indel['del'], do_binconf = True)

In [None]:
norm_internal_indel_ins['MR_stemlen'] = mut_norm_conf(count_internal_resum(count_internal_indel['MR'], useful_cols= ['category', '#MM', 'stem_len'], gc_correct=False, tri_subset = triplet_mutations_und_indel, selections=[False, 1, False]), snvindel = 'indel', min_count = 10, genome_AC_freq_current = variants_indel_count_AC_freq_all, tri_subset = triplet_mutations_und_ins, normtorandom = True, random_normaverage = normtorandom_indel['ins'], do_binconf = True)
norm_internal_indel_del['MR_stemlen'] = mut_norm_conf(count_internal_resum(count_internal_indel['MR'], useful_cols= ['category', '#MM', 'stem_len'], gc_correct=False, tri_subset = triplet_mutations_und_indel, selections=[False, 1, False]), snvindel = 'indel', min_count = 10, genome_AC_freq_current = variants_indel_count_AC_freq_all, tri_subset = triplet_mutations_und_del, normtorandom = True, random_normaverage = normtorandom_indel['del'], do_binconf = True)

norm_internal_indel_ins['MR_stemlen_loop10'] = mut_norm_conf(count_internal_resum(count_internal_resum(count_internal_indel['MR'], useful_cols= ['category', '#MM', 'spacer', 'stem_len'], gc_correct=False, tri_subset = triplet_mutations_und_indel, selections=[False, 1, list(range(11)), False]), useful_cols= ['category', '#MM', 'stem_len'], gc_correct=False, tri_subset = triplet_mutations_und_indel), snvindel = 'indel', min_count = 10, genome_AC_freq_current = variants_indel_count_AC_freq_all, tri_subset = triplet_mutations_und_ins, normtorandom = True, random_normaverage = normtorandom_indel['ins'], do_binconf = True)
norm_internal_indel_del['MR_stemlen_loop10'] = mut_norm_conf(count_internal_resum(count_internal_resum(count_internal_indel['MR'], useful_cols= ['category', '#MM', 'spacer', 'stem_len'], gc_correct=False, tri_subset = triplet_mutations_und_indel, selections=[False, 1, list(range(11)), False]), useful_cols= ['category', '#MM', 'stem_len'], gc_correct=False, tri_subset = triplet_mutations_und_indel), snvindel = 'indel', min_count = 10, genome_AC_freq_current = variants_indel_count_AC_freq_all, tri_subset = triplet_mutations_und_del, normtorandom = True, random_normaverage = normtorandom_indel['del'], do_binconf = True)

norm_internal_indel_ins['MR_stemlen_loop04'] = mut_norm_conf(count_internal_resum(count_internal_resum(count_internal_indel['MR'], useful_cols= ['category', '#MM', 'spacer', 'stem_len'], gc_correct=False, tri_subset = triplet_mutations_und_indel, selections=[False, 1, list(range(5)), False]), useful_cols= ['category', '#MM', 'stem_len'], gc_correct=False, tri_subset = triplet_mutations_und_indel), snvindel = 'indel', min_count = 10, genome_AC_freq_current = variants_indel_count_AC_freq_all, tri_subset = triplet_mutations_und_ins, normtorandom = True, random_normaverage = normtorandom_indel['ins'], do_binconf = True)
norm_internal_indel_del['MR_stemlen_loop04'] = mut_norm_conf(count_internal_resum(count_internal_resum(count_internal_indel['MR'], useful_cols= ['category', '#MM', 'spacer', 'stem_len'], gc_correct=False, tri_subset = triplet_mutations_und_indel, selections=[False, 1, list(range(5)), False]), useful_cols= ['category', '#MM', 'stem_len'], gc_correct=False, tri_subset = triplet_mutations_und_indel), snvindel = 'indel', min_count = 10, genome_AC_freq_current = variants_indel_count_AC_freq_all, tri_subset = triplet_mutations_und_del, normtorandom = True, random_normaverage = normtorandom_indel['del'], do_binconf = True)

norm_internal_indel_ins['MR_stemlen_loop510'] = mut_norm_conf(count_internal_resum(count_internal_resum(count_internal_indel['MR'], useful_cols= ['category', '#MM', 'spacer', 'stem_len'], gc_correct=False, tri_subset = triplet_mutations_und_indel, selections=[False, 1, list(range(5,10)), False]), useful_cols= ['category', '#MM', 'stem_len'], gc_correct=False, tri_subset = triplet_mutations_und_indel), snvindel = 'indel', min_count = 10, genome_AC_freq_current = variants_indel_count_AC_freq_all, tri_subset = triplet_mutations_und_ins, normtorandom = True, random_normaverage = normtorandom_indel['ins'], do_binconf = True)
norm_internal_indel_del['MR_stemlen_loop510'] = mut_norm_conf(count_internal_resum(count_internal_resum(count_internal_indel['MR'], useful_cols= ['category', '#MM', 'spacer', 'stem_len'], gc_correct=False, tri_subset = triplet_mutations_und_indel, selections=[False, 1, list(range(5,10)), False]), useful_cols= ['category', '#MM', 'stem_len'], gc_correct=False, tri_subset = triplet_mutations_und_indel), snvindel = 'indel', min_count = 10, genome_AC_freq_current = variants_indel_count_AC_freq_all, tri_subset = triplet_mutations_und_del, normtorandom = True, random_normaverage = normtorandom_indel['del'], do_binconf = True)

norm_internal_indel_ins['MR_stemlen_loop1020'] = mut_norm_conf(count_internal_resum(count_internal_resum(count_internal_indel['MR'], useful_cols= ['category', '#MM', 'spacer', 'stem_len'], gc_correct=False, tri_subset = triplet_mutations_und_indel, selections=[False, 1, list(range(11,20)), False]), useful_cols= ['category', '#MM', 'stem_len'], gc_correct=False, tri_subset = triplet_mutations_und_indel), snvindel = 'indel', min_count = 10, genome_AC_freq_current = variants_indel_count_AC_freq_all, tri_subset = triplet_mutations_und_ins, normtorandom = True, random_normaverage = normtorandom_indel['ins'], do_binconf = True)
norm_internal_indel_del['MR_stemlen_loop1020'] = mut_norm_conf(count_internal_resum(count_internal_resum(count_internal_indel['MR'], useful_cols= ['category', '#MM', 'spacer', 'stem_len'], gc_correct=False, tri_subset = triplet_mutations_und_indel, selections=[False, 1, list(range(11,20)), False]), useful_cols= ['category', '#MM', 'stem_len'], gc_correct=False, tri_subset = triplet_mutations_und_indel), snvindel = 'indel', min_count = 10, genome_AC_freq_current = variants_indel_count_AC_freq_all, tri_subset = triplet_mutations_und_del, normtorandom = True, random_normaverage = normtorandom_indel['del'], do_binconf = True)

norm_internal_indel_ins['MR_stemlen_loop20100'] = mut_norm_conf(count_internal_resum(count_internal_resum(count_internal_indel['MR'], useful_cols= ['category', '#MM', 'spacer', 'stem_len'], gc_correct=False, tri_subset = triplet_mutations_und_indel, selections=[False, 1, list(range(21,100)), False]), useful_cols= ['category', '#MM', 'stem_len'], gc_correct=False, tri_subset = triplet_mutations_und_indel), snvindel = 'indel', min_count = 10, genome_AC_freq_current = variants_indel_count_AC_freq_all, tri_subset = triplet_mutations_und_ins, normtorandom = True, random_normaverage = normtorandom_indel['ins'], do_binconf = True)
norm_internal_indel_del['MR_stemlen_loop20100'] = mut_norm_conf(count_internal_resum(count_internal_resum(count_internal_indel['MR'], useful_cols= ['category', '#MM', 'spacer', 'stem_len'], gc_correct=False, tri_subset = triplet_mutations_und_indel, selections=[False, 1, list(range(21,100)), False]), useful_cols= ['category', '#MM', 'stem_len'], gc_correct=False, tri_subset = triplet_mutations_und_indel), snvindel = 'indel', min_count = 10, genome_AC_freq_current = variants_indel_count_AC_freq_all, tri_subset = triplet_mutations_und_del, normtorandom = True, random_normaverage = normtorandom_indel['del'], do_binconf = True)

In [None]:
norm_internal_indel_ins['ZDNA'] = mut_norm_conf(count_internal_resum(count_internal_indel['ZDNA'], useful_cols= ['category', 'length'], gc_correct=False, tri_subset = triplet_mutations_und_indel), snvindel = 'indel', min_count = 10, genome_AC_freq_current = variants_indel_count_AC_freq_all, tri_subset = triplet_mutations_und_ins, normtorandom = True, random_normaverage = normtorandom_indel['ins'], do_binconf = True)
norm_internal_indel_del['ZDNA'] = mut_norm_conf(count_internal_resum(count_internal_indel['ZDNA'], useful_cols= ['category', 'length'], gc_correct=False, tri_subset = triplet_mutations_und_indel), snvindel = 'indel', min_count = 10, genome_AC_freq_current = variants_indel_count_AC_freq_all, tri_subset = triplet_mutations_und_del, normtorandom = True, random_normaverage = normtorandom_indel['del'], do_binconf = True)

#### Counting and analysis of indels  - G4s <a name="mutation_internal_indels_analysis_G4"></a>

[Return to Table of Contents](#TOC)

In [None]:
count_internal_indel['G4_K+'] = count_internal_resum(count_internal_indel['G4'], useful_cols= ['status', 'category'], selections = [['K+', 'both'], False], gc_correct=False, tri_subset = triplet_mutations_und_indel)

norm_internal_indel_ins['G4'] = mut_norm_conf(count_internal_indel['G4_K+'], snvindel = 'indel', min_count = 10, genome_AC_freq_current = variants_indel_count_AC_freq_all, tri_subset = triplet_mutations_und_ins, normtorandom = True, random_normaverage = normtorandom_indel['ins'], do_binconf = True)
norm_internal_indel_del['G4'] = mut_norm_conf(count_internal_indel['G4_K+'], snvindel = 'indel', min_count = 10, genome_AC_freq_current = variants_indel_count_AC_freq_all, tri_subset = triplet_mutations_und_del, normtorandom = True, random_normaverage = normtorandom_indel['del'], do_binconf = True)

In [None]:
# Normalization by position
norm_internal_indel_G4_positions_ins = dict()
norm_internal_indel_G4_positions_ins["run 5'"] = mut_norm_conf(count_internal_resum(count_internal_indel['G4_K+'], ['category'], selections = ['run_positions_edge'], gc_correct=False, tri_subset = triplet_mutations_und_indel), tri_subset = [mut for mut in triplet_mutations_und_ins if (mut[0] != 'G')], snvindel = 'indel', min_count = 10, genome_AC_freq_current = variants_indel_count_AC_freq_all, normtorandom = True, random_normaverage=normtorandom_indel['ins'], gc_correct=False)
norm_internal_indel_G4_positions_ins["run 3'"] = mut_norm_conf(count_internal_resum(count_internal_indel['G4_K+'], ['category'], selections = ['run_positions_edge'], gc_correct=False, tri_subset = triplet_mutations_und_indel), tri_subset = [mut for mut in triplet_mutations_und_ins if (mut[2] != 'G')],snvindel = 'indel', min_count = 10, genome_AC_freq_current = variants_indel_count_AC_freq_all, normtorandom = True, random_normaverage=normtorandom_indel['ins'], gc_correct=False)
norm_internal_indel_G4_positions_ins["loop 5'"] = mut_norm_conf(count_internal_resum(count_internal_indel['G4_K+'], ['category'], selections = ['loop_positions_edge'], gc_correct=False, tri_subset = triplet_mutations_und_indel), tri_subset = [mut for mut in triplet_mutations_und_ins if (mut[2] != 'G')], snvindel = 'indel', min_count = 10, genome_AC_freq_current = variants_indel_count_AC_freq_all, normtorandom = True, random_normaverage=normtorandom_indel['ins'], gc_correct=False)
norm_internal_indel_G4_positions_ins["loop 3'"] = mut_norm_conf(count_internal_resum(count_internal_indel['G4_K+'], ['category'], selections = ['loop_positions_edge'], gc_correct=False, tri_subset = triplet_mutations_und_indel), tri_subset = [mut for mut in triplet_mutations_und_ins if (mut[0] != 'G')], snvindel = 'indel', min_count = 10, genome_AC_freq_current = variants_indel_count_AC_freq_all, normtorandom = True, random_normaverage=normtorandom_indel['ins'], gc_correct=False)
norm_internal_indel_G4_positions_ins["loop 1nt"] = mut_norm_conf(count_internal_resum(count_internal_indel['G4_K+'], ['category'], selections = ['loop_positions_edge'], gc_correct=False, tri_subset = triplet_mutations_und_indel), tri_subset = [mut for mut in triplet_mutations_und_ins if (mut[0] == 'G') & (mut[2] == 'G')], snvindel = 'indel', min_count = 10, genome_AC_freq_current = variants_indel_count_AC_freq_all, normtorandom = True, random_normaverage=normtorandom_indel['ins'], gc_correct=False)
norm_internal_indel_G4_positions_ins['run_positions_middle'] =  mut_norm_conf(count_internal_resum(count_internal_indel['G4_K+'], ['category'], selections = ['run_positions_middle'], gc_correct=False, tri_subset = triplet_mutations_und_indel), tri_subset = triplet_mutations_und_ins, snvindel = 'indel', min_count = 10, genome_AC_freq_current = variants_indel_count_AC_freq_all, normtorandom = True, random_normaverage=normtorandom_indel['ins'], gc_correct=False)
norm_internal_indel_G4_positions_ins['loop_positions_middle'] =  mut_norm_conf(count_internal_resum(count_internal_indel['G4_K+'], ['category'], selections = ['loop_positions_middle'], gc_correct=False, tri_subset = triplet_mutations_und_indel), tri_subset = triplet_mutations_und_ins, snvindel = 'indel', min_count = 10, genome_AC_freq_current = variants_indel_count_AC_freq_all, normtorandom = True, random_normaverage=normtorandom_indel['ins'], gc_correct=False)

norm_internal_indel_G4_positions_del = dict()
norm_internal_indel_G4_positions_del["run 5'"] = mut_norm_conf(count_internal_resum(count_internal_indel['G4_K+'], ['category'], selections = ['run_positions_edge'], gc_correct=False, tri_subset = triplet_mutations_und_indel), tri_subset = [mut for mut in triplet_mutations_und_del if (mut[0] != 'G')], snvindel = 'indel', min_count = 10, genome_AC_freq_current = variants_indel_count_AC_freq_all, normtorandom = True, random_normaverage=normtorandom_indel['del'], gc_correct=False)
norm_internal_indel_G4_positions_del["run 3'"] = mut_norm_conf(count_internal_resum(count_internal_indel['G4_K+'], ['category'], selections = ['run_positions_edge'], gc_correct=False, tri_subset = triplet_mutations_und_indel), tri_subset = [mut for mut in triplet_mutations_und_del if (mut[2] != 'G')],snvindel = 'indel', min_count = 10, genome_AC_freq_current = variants_indel_count_AC_freq_all, normtorandom = True, random_normaverage=normtorandom_indel['del'], gc_correct=False)
norm_internal_indel_G4_positions_del["loop 5'"] = mut_norm_conf(count_internal_resum(count_internal_indel['G4_K+'], ['category'], selections = ['loop_positions_edge'], gc_correct=False, tri_subset = triplet_mutations_und_indel), tri_subset = [mut for mut in triplet_mutations_und_del if (mut[2] != 'G')], snvindel = 'indel', min_count = 10, genome_AC_freq_current = variants_indel_count_AC_freq_all, normtorandom = True, random_normaverage=normtorandom_indel['del'], gc_correct=False)
norm_internal_indel_G4_positions_del["loop 3'"] = mut_norm_conf(count_internal_resum(count_internal_indel['G4_K+'], ['category'], selections = ['loop_positions_edge'], gc_correct=False, tri_subset = triplet_mutations_und_indel), tri_subset = [mut for mut in triplet_mutations_und_del if (mut[0] != 'G')], snvindel = 'indel', min_count = 10, genome_AC_freq_current = variants_indel_count_AC_freq_all, normtorandom = True, random_normaverage=normtorandom_indel['del'], gc_correct=False)
norm_internal_indel_G4_positions_del["loop 1nt"] = mut_norm_conf(count_internal_resum(count_internal_indel['G4_K+'], ['category'], selections = ['loop_positions_edge'], gc_correct=False, tri_subset = triplet_mutations_und_indel), tri_subset = [mut for mut in triplet_mutations_und_del if (mut[0] == 'G') & (mut[2] == 'G')], snvindel = 'indel', min_count = 10, genome_AC_freq_current = variants_indel_count_AC_freq_all, normtorandom = True, random_normaverage=normtorandom_indel['del'], gc_correct=False)
norm_internal_indel_G4_positions_del['run_positions_middle'] =  mut_norm_conf(count_internal_resum(count_internal_indel['G4_K+'], ['category'], selections = ['run_positions_middle'], gc_correct=False, tri_subset = triplet_mutations_und_indel), tri_subset = triplet_mutations_und_del, snvindel = 'indel', min_count = 10, genome_AC_freq_current = variants_indel_count_AC_freq_all, normtorandom = True, random_normaverage=normtorandom_indel['del'], gc_correct=False)
norm_internal_indel_G4_positions_del['loop_positions_middle'] =  mut_norm_conf(count_internal_resum(count_internal_indel['G4_K+'], ['category'], selections = ['loop_positions_middle'], gc_correct=False, tri_subset = triplet_mutations_und_indel), tri_subset = triplet_mutations_und_del, snvindel = 'indel', min_count = 10, genome_AC_freq_current = variants_indel_count_AC_freq_all, normtorandom = True, random_normaverage=normtorandom_indel['del'], gc_correct=False)

In [None]:
# Reorganize data structure
G4_positionorder = ["run 5'", "run_positions_middle", "run 3'", "loop 5'", "loop_positions_middle",  "loop 3'", "loop 1nt"]
G4_norm_summary_ins = dict()
G4_norm_summary_ins[0] = pd.DataFrame(); G4_norm_summary_ins[1] = pd.DataFrame(); G4_norm_summary_ins[2] = pd.DataFrame()
for QC_cutoff in vqslod_list_indel:
    G4_norm_summary_ins[0][QC_cutoff] = pd.Series(list(pd.concat([norm_internal_indel_G4_positions_ins[category][0][QC_cutoff] for category in norm_internal_indel_G4_positions_ins])), index = list(norm_internal_indel_G4_positions_ins)).reindex(G4_positionorder)
    G4_norm_summary_ins[1][QC_cutoff] = pd.Series(list(pd.concat([norm_internal_indel_G4_positions_ins[category][1][QC_cutoff] for category in norm_internal_indel_G4_positions_ins])), index = list(norm_internal_indel_G4_positions_ins)).reindex(G4_positionorder)
    G4_norm_summary_ins[2][QC_cutoff] = pd.Series(list(pd.concat([norm_internal_indel_G4_positions_ins[category][2][QC_cutoff] for category in norm_internal_indel_G4_positions_ins])), index = list(norm_internal_indel_G4_positions_ins)).reindex(G4_positionorder)

G4_positionorder = ["run 5'", "run_positions_middle", "run 3'", "loop 5'", "loop_positions_middle",  "loop 3'", "loop 1nt"]
G4_norm_summary_del = dict()
G4_norm_summary_del[0] = pd.DataFrame(); G4_norm_summary_del[1] = pd.DataFrame(); G4_norm_summary_del[2] = pd.DataFrame()
for QC_cutoff in vqslod_list_indel:
    G4_norm_summary_del[0][QC_cutoff] = pd.Series(list(pd.concat([norm_internal_indel_G4_positions_del[category][0][QC_cutoff] for category in norm_internal_indel_G4_positions_del])), index = list(norm_internal_indel_G4_positions_del)).reindex(G4_positionorder)
    G4_norm_summary_del[1][QC_cutoff] = pd.Series(list(pd.concat([norm_internal_indel_G4_positions_del[category][1][QC_cutoff] for category in norm_internal_indel_G4_positions_del])), index = list(norm_internal_indel_G4_positions_del)).reindex(G4_positionorder)
    G4_norm_summary_del[2][QC_cutoff] = pd.Series(list(pd.concat([norm_internal_indel_G4_positions_del[category][2][QC_cutoff] for category in norm_internal_indel_G4_positions_del])), index = list(norm_internal_indel_G4_positions_del)).reindex(G4_positionorder)

In [None]:
# Triplet-level data for G4 positions
norm_internal_indel_G4_positions_tri_ins = dict()
norm_internal_indel_G4_positions_tri_ins["run 5'"] = mut_norm_conf(count_internal_resum(count_internal_indel['G4_K+'], ['category'], selections = ['run_positions_edge'], gc_correct=False, tri_subset = triplet_mutations_und_indel), tri_subset = [mut for mut in triplet_mutations_und_ins if (mut[0] != 'G')], snvindel = 'indel', min_count = 10, genome_AC_freq_current = variants_indel_count_AC_freq_all, normtorandom = True, random_normaverage=normtorandom_indel['ins'], gc_correct=False, output_div=True)
norm_internal_indel_G4_positions_tri_ins["run 3'"] = mut_norm_conf(count_internal_resum(count_internal_indel['G4_K+'], ['category'], selections = ['run_positions_edge'], gc_correct=False, tri_subset = triplet_mutations_und_indel), tri_subset = [mut for mut in triplet_mutations_und_ins if (mut[2] != 'G')],snvindel = 'indel', min_count = 10, genome_AC_freq_current = variants_indel_count_AC_freq_all, normtorandom = True, random_normaverage=normtorandom_indel['ins'], gc_correct=False, output_div=True)
norm_internal_indel_G4_positions_tri_ins["loop 5'"] = mut_norm_conf(count_internal_resum(count_internal_indel['G4_K+'], ['category'], selections = ['loop_positions_edge'], gc_correct=False, tri_subset = triplet_mutations_und_indel), tri_subset = [mut for mut in triplet_mutations_und_ins if (mut[2] != 'G')], snvindel = 'indel', min_count = 10, genome_AC_freq_current = variants_indel_count_AC_freq_all, normtorandom = True, random_normaverage=normtorandom_indel['ins'], gc_correct=False, output_div=True)
norm_internal_indel_G4_positions_tri_ins["loop 3'"] = mut_norm_conf(count_internal_resum(count_internal_indel['G4_K+'], ['category'], selections = ['loop_positions_edge'], gc_correct=False, tri_subset = triplet_mutations_und_indel), tri_subset = [mut for mut in triplet_mutations_und_ins if (mut[0] != 'G')], snvindel = 'indel', min_count = 10, genome_AC_freq_current = variants_indel_count_AC_freq_all, normtorandom = True, random_normaverage=normtorandom_indel['ins'], gc_correct=False, output_div=True)
norm_internal_indel_G4_positions_tri_ins["loop 1nt"] = mut_norm_conf(count_internal_resum(count_internal_indel['G4_K+'], ['category'], selections = ['loop_positions_edge'], gc_correct=False, tri_subset = triplet_mutations_und_indel), tri_subset = [mut for mut in triplet_mutations_und_ins if (mut[0] == 'G') & (mut[2] == 'G')], snvindel = 'indel', min_count = 10, genome_AC_freq_current = variants_indel_count_AC_freq_all, normtorandom = True, random_normaverage=normtorandom_indel['ins'], gc_correct=False, output_div=True)
norm_internal_indel_G4_positions_tri_ins['run_positions_middle'] =  mut_norm_conf(count_internal_resum(count_internal_indel['G4_K+'], ['category'], selections = ['run_positions_middle'], gc_correct=False, tri_subset = triplet_mutations_und_indel), tri_subset = triplet_mutations_und_ins, snvindel = 'indel', min_count = 10, genome_AC_freq_current = variants_indel_count_AC_freq_all, normtorandom = True, random_normaverage=normtorandom_indel['ins'], gc_correct=False, output_div=True)
norm_internal_indel_G4_positions_tri_ins['loop_positions_middle'] =  mut_norm_conf(count_internal_resum(count_internal_indel['G4_K+'], ['category'], selections = ['loop_positions_middle'], gc_correct=False, tri_subset = triplet_mutations_und_indel), tri_subset = triplet_mutations_und_ins, snvindel = 'indel', min_count = 10, genome_AC_freq_current = variants_indel_count_AC_freq_all, normtorandom = True, random_normaverage=normtorandom_indel['ins'], gc_correct=False, output_div=True)

norm_internal_indel_G4_positions_tri_del = dict()
norm_internal_indel_G4_positions_tri_del["run 5'"] = mut_norm_conf(count_internal_resum(count_internal_indel['G4_K+'], ['category'], selections = ['run_positions_edge'], gc_correct=False, tri_subset = triplet_mutations_und_indel), tri_subset = [mut for mut in triplet_mutations_und_del if (mut[0] != 'G')], snvindel = 'indel', min_count = 10, genome_AC_freq_current = variants_indel_count_AC_freq_all, normtorandom = True, random_normaverage=normtorandom_indel['del'], gc_correct=False, output_div=True)
norm_internal_indel_G4_positions_tri_del["run 3'"] = mut_norm_conf(count_internal_resum(count_internal_indel['G4_K+'], ['category'], selections = ['run_positions_edge'], gc_correct=False, tri_subset = triplet_mutations_und_indel), tri_subset = [mut for mut in triplet_mutations_und_del if (mut[2] != 'G')],snvindel = 'indel', min_count = 10, genome_AC_freq_current = variants_indel_count_AC_freq_all, normtorandom = True, random_normaverage=normtorandom_indel['del'], gc_correct=False, output_div=True)
norm_internal_indel_G4_positions_tri_del["loop 5'"] = mut_norm_conf(count_internal_resum(count_internal_indel['G4_K+'], ['category'], selections = ['loop_positions_edge'], gc_correct=False, tri_subset = triplet_mutations_und_indel), tri_subset = [mut for mut in triplet_mutations_und_del if (mut[2] != 'G')], snvindel = 'indel', min_count = 10, genome_AC_freq_current = variants_indel_count_AC_freq_all, normtorandom = True, random_normaverage=normtorandom_indel['del'], gc_correct=False, output_div=True)
norm_internal_indel_G4_positions_tri_del["loop 3'"] = mut_norm_conf(count_internal_resum(count_internal_indel['G4_K+'], ['category'], selections = ['loop_positions_edge'], gc_correct=False, tri_subset = triplet_mutations_und_indel), tri_subset = [mut for mut in triplet_mutations_und_del if (mut[0] != 'G')], snvindel = 'indel', min_count = 10, genome_AC_freq_current = variants_indel_count_AC_freq_all, normtorandom = True, random_normaverage=normtorandom_indel['del'], gc_correct=False, output_div=True)
norm_internal_indel_G4_positions_tri_del["loop 1nt"] = mut_norm_conf(count_internal_resum(count_internal_indel['G4_K+'], ['category'], selections = ['loop_positions_edge'], gc_correct=False, tri_subset = triplet_mutations_und_indel), tri_subset = [mut for mut in triplet_mutations_und_del if (mut[0] == 'G') & (mut[2] == 'G')], snvindel = 'indel', min_count = 10, genome_AC_freq_current = variants_indel_count_AC_freq_all, normtorandom = True, random_normaverage=normtorandom_indel['del'], gc_correct=False, output_div=True)
norm_internal_indel_G4_positions_tri_del['run_positions_middle'] =  mut_norm_conf(count_internal_resum(count_internal_indel['G4_K+'], ['category'], selections = ['run_positions_middle'], gc_correct=False, tri_subset = triplet_mutations_und_indel), tri_subset = triplet_mutations_und_del, snvindel = 'indel', min_count = 10, genome_AC_freq_current = variants_indel_count_AC_freq_all, normtorandom = True, random_normaverage=normtorandom_indel['del'], gc_correct=False, output_div=True)
norm_internal_indel_G4_positions_tri_del['loop_positions_middle'] =  mut_norm_conf(count_internal_resum(count_internal_indel['G4_K+'], ['category'], selections = ['loop_positions_middle'], gc_correct=False, tri_subset = triplet_mutations_und_indel), tri_subset = triplet_mutations_und_del, snvindel = 'indel', min_count = 10, genome_AC_freq_current = variants_indel_count_AC_freq_all, normtorandom = True, random_normaverage=normtorandom_indel['del'], gc_correct=False, output_div=True)

In [None]:
# Reorganize data structure
G4_norm_summary_ins_tri = dict()
for QC_cutoff in vqslod_list_indel:
    G4_norm_summary_ins_tri[QC_cutoff] = pd.concat([norm_internal_indel_G4_positions_tri_ins[category][0][QC_cutoff] for category in norm_internal_indel_G4_positions_tri_ins], axis=1) / normtorandom_indel['ins'][QC_cutoff]
    G4_norm_summary_ins_tri[QC_cutoff].columns = list(norm_internal_indel_G4_positions_tri_ins)

G4_norm_summary_del_tri = dict()
for QC_cutoff in vqslod_list_indel:
    G4_norm_summary_del_tri[QC_cutoff] = pd.concat([norm_internal_indel_G4_positions_tri_del[category][0][QC_cutoff] for category in norm_internal_indel_G4_positions_tri_del], axis=1) / normtorandom_indel['del'][QC_cutoff]
    G4_norm_summary_del_tri[QC_cutoff].columns = list(norm_internal_indel_G4_positions_tri_del)

G4_norm_summary_indel_tri = dict()
for QC_cutoff in vqslod_list_indel:
    G4_norm_summary_indel_tri[QC_cutoff] = pd.concat([G4_norm_summary_ins_tri[QC_cutoff], G4_norm_summary_del_tri[QC_cutoff]])

#### Save/load internal indel counts <a name="mutation_internal_indels_saveload"></a>

[Return to Table of Contents](#TOC)

In [None]:
# Save temporary output of the mutation counts
with open('./analysis/temp/mut_internal_indel_counts_chr'+str(chr_range)+'-22.pickle', 'wb') as handle:
    pickle.dump(count_internal_indel, handle, protocol=pickle.HIGHEST_PROTOCOL)

# Save temporary output of the normalized mutation counts
with open('./analysis/temp/mut_internal_ins_norm_chr'+str(chr_range)+'-22.pickle', 'wb') as handle:
    pickle.dump(norm_internal_indel_ins, handle, protocol=pickle.HIGHEST_PROTOCOL)

with open('./analysis/temp/mut_internal_del_norm_chr'+str(chr_range)+'-22.pickle', 'wb') as handle:
    pickle.dump(norm_internal_indel_del, handle, protocol=pickle.HIGHEST_PROTOCOL)

In [None]:
# Load temporary output of the mutation counts
with open('./analysis/temp/mut_internal_indel_counts_chr'+str(chr_range)+'-22.pickle', 'rb') as handle:
    count_internal_indel = pickle.load(handle)

# Load temporary output of the mutation counts
with open('./analysis/temp/mut_internal_ins_norm_chr'+str(chr_range)+'-22.pickle', 'rb') as handle:
    norm_internal_indel_ins = pickle.load(handle)
with open('./analysis/temp/mut_internal_del_norm_chr'+str(chr_range)+'-22.pickle', 'rb') as handle:
    norm_internal_indel_del = pickle.load(handle)

### Plots  <a name="mutation_internal_indels_plots"></a>

[Return to Table of Contents](#TOC)

In [None]:
# Assign colors for plots
QC_colors_internal_indel = make_colorscale(vqslod_list_indel, 0.5)
QC_colors_internal_indel = pd.DataFrame(QC_colors_internal_indel).transpose()
QC_colors_internal_indel['name'] = ['no QC', 'pass', 'VQSLOD >0', 'VQSLOD >1.4']

#### Plot Fig 4a - indels within direct repeats  <a name="mutation_internal_indels_plots_fig4A"></a>

[Return to Table of Contents](#TOC)

In [None]:
counter = 0
inframe_mut_fig4a = make_subplots(rows=1, cols=1, shared_yaxes=True, shared_xaxes = True, vertical_spacing = 0.030, horizontal_spacing = 0.02, subplot_titles = ['Insertions >5 nt'])
counter =1
inframe_mut_fig4a.update_yaxes(title = dict(text = 'DR', font = dict(size = 16)), row = counter, col = 1)
plot_internal_add('DR_longins_stemlen_loop10', [10,20], ['flank_pos', 1], counter, 1, norm_internal_indel_ins, showleg = False, plot_name = inframe_mut_fig4a, pred_factor = 1, QC_colors_current=QC_colors_internal_indel)
inframe_mut_fig4a.update_yaxes(zeroline = False, range = [-0.1,2.6], type = 'log', dtick = 1)
inframe_mut_fig4a.update_xaxes(title = dict(text = 'motif length', font = dict(size = 14), standoff = 0), row = 1, col = 1)
inframe_mut_fig4a.update_layout(width = 250, height = 200, margin = dict(l = 45, r = 25, b = 40, t = 30), legend=dict(y = -0.25, x = 0.175, orientation='h'))

inframe_mut_fig4a.show()

In [None]:
inframe_mut_fig4a.write_image('./plots/revision_DR_indelfreq_simpler_fig4a.png', format='png', scale = 10, engine = 'orca')

#### Plot for Fig. S4b - DRs  <a name="mutation_internal_indels_plots_figS4B"></a>

[Return to Table of Contents](#TOC)

In [None]:
counter = 0
inframe_mut_figS4b = make_subplots(rows=6, cols=10, shared_yaxes=True, shared_xaxes = True, vertical_spacing = 0.030, horizontal_spacing = 0.02, subplot_titles = ['Flank', 'Within Motif', 'MM pos.', 'Sp. ends', 'Sp. mid', 'Flank', 'Within Motif', 'MM pos.', 'Sp. ends', 'Sp. mid'])
counter =1
plot_internal_add('DR_stemlen_loop04', [10,20], ['flank_pos', 1], counter, 1, norm_internal_indel_ins, showleg = True if counter == 1 else False, plot_name = inframe_mut_figS4b, pred_factor = 1, QC_colors_current=QC_colors_internal_indel)
plot_internal_add('DR_stemlen_loop04', [10,20], ['motif_pos_genome_middle', 1], counter, 2, norm_internal_indel_ins, showleg = False, plot_name = inframe_mut_figS4b, pred_factor = 1, QC_colors_current=QC_colors_internal_indel)
plot_internal_add('DR_stemlen_loop04', [10,20], ['MM_pos', 1], counter, 3, norm_internal_indel_ins, showleg = False, plot_name = inframe_mut_figS4b, pred_factor = 1, QC_colors_current=QC_colors_internal_indel)
plot_internal_add('DR_stemlen_loop04', [10,20], ['spacer_pos', 1], counter, 4, norm_internal_indel_ins, showleg = False, plot_name = inframe_mut_figS4b, pred_factor = 1, QC_colors_current=QC_colors_internal_indel)
plot_internal_add('DR_stemlen_loop04', [10,20], ['spacer_middle_pos', 1], counter, 5, norm_internal_indel_ins, showleg = False, plot_name = inframe_mut_figS4b, pred_factor = 1, QC_colors_current=QC_colors_internal_indel)
plot_internal_add('DR_stemlen_loop04', [10,20], ['flank_pos', 1], counter, 6, norm_internal_indel_del, showleg = False, plot_name = inframe_mut_figS4b, pred_factor = 1, QC_colors_current=QC_colors_internal_indel)
plot_internal_add('DR_stemlen_loop04', [10,20], ['motif_pos_genome_middle', 1], counter, 7, norm_internal_indel_del, showleg = False, plot_name = inframe_mut_figS4b, pred_factor = 1, QC_colors_current=QC_colors_internal_indel)
plot_internal_add('DR_stemlen_loop04', [10,20], ['MM_pos', 1], counter, 8, norm_internal_indel_del, showleg = False, plot_name = inframe_mut_figS4b, pred_factor = 1, QC_colors_current=QC_colors_internal_indel)
plot_internal_add('DR_stemlen_loop04', [10,20], ['spacer_pos', 1], counter, 9, norm_internal_indel_del, showleg = False, plot_name = inframe_mut_figS4b, pred_factor = 1, QC_colors_current=QC_colors_internal_indel)
plot_internal_add('DR_stemlen_loop04', [10,20], ['spacer_middle_pos', 1], counter, 10, norm_internal_indel_del, showleg = False, plot_name = inframe_mut_figS4b, pred_factor = 1, QC_colors_current=QC_colors_internal_indel)
inframe_mut_figS4b.update_yaxes(title = dict(text = 'Spacer 0-4'), row = counter, col = 1)
counter +=1
plot_internal_add('DR_stemlen_loop510', [10,20], ['flank_pos', 1], counter, 1, norm_internal_indel_ins, showleg = True if counter == 1 else False, plot_name = inframe_mut_figS4b, pred_factor = 1, QC_colors_current=QC_colors_internal_indel)
plot_internal_add('DR_stemlen_loop510', [10,20], ['motif_pos_genome_middle', 1], counter, 2, norm_internal_indel_ins, showleg = False, plot_name = inframe_mut_figS4b, pred_factor = 1, QC_colors_current=QC_colors_internal_indel)
plot_internal_add('DR_stemlen_loop510', [10,20], ['MM_pos', 1], counter, 3, norm_internal_indel_ins, showleg = False, plot_name = inframe_mut_figS4b, pred_factor = 1, QC_colors_current=QC_colors_internal_indel)
plot_internal_add('DR_stemlen_loop510', [10,20], ['spacer_pos', 1], counter, 4, norm_internal_indel_ins, showleg = False, plot_name = inframe_mut_figS4b, pred_factor = 1, QC_colors_current=QC_colors_internal_indel)
plot_internal_add('DR_stemlen_loop510', [10,20], ['spacer_middle_pos', 1], counter, 5, norm_internal_indel_ins, showleg = False, plot_name = inframe_mut_figS4b, pred_factor = 1, QC_colors_current=QC_colors_internal_indel)
plot_internal_add('DR_stemlen_loop510', [10,20], ['flank_pos', 1], counter, 6, norm_internal_indel_del, showleg = False, plot_name = inframe_mut_figS4b, pred_factor = 1, QC_colors_current=QC_colors_internal_indel)
plot_internal_add('DR_stemlen_loop510', [10,20], ['motif_pos_genome_middle', 1], counter, 7, norm_internal_indel_del, showleg = False, plot_name = inframe_mut_figS4b, pred_factor = 1, QC_colors_current=QC_colors_internal_indel)
plot_internal_add('DR_stemlen_loop510', [10,20], ['MM_pos', 1], counter, 8, norm_internal_indel_del, showleg = False, plot_name = inframe_mut_figS4b, pred_factor = 1, QC_colors_current=QC_colors_internal_indel)
plot_internal_add('DR_stemlen_loop510', [10,20], ['spacer_pos', 1], counter, 9, norm_internal_indel_del, showleg = False, plot_name = inframe_mut_figS4b, pred_factor = 1, QC_colors_current=QC_colors_internal_indel)
plot_internal_add('DR_stemlen_loop510', [10,20], ['spacer_middle_pos', 1], counter, 10, norm_internal_indel_del, showleg = False, plot_name = inframe_mut_figS4b, pred_factor = 1, QC_colors_current=QC_colors_internal_indel)
inframe_mut_figS4b.update_yaxes(title = dict(text = 'Spacer 5-10'), row = counter, col = 1)
counter +=1
plot_internal_add('DR_stemlen_loop1020', [10,20], ['flank_pos', 1], counter, 1, norm_internal_indel_ins, showleg = True if counter == 1 else False, plot_name = inframe_mut_figS4b, pred_factor = 1, QC_colors_current=QC_colors_internal_indel)
plot_internal_add('DR_stemlen_loop1020', [10,20], ['motif_pos_genome_middle', 1], counter, 2, norm_internal_indel_ins, showleg = False, plot_name = inframe_mut_figS4b, pred_factor = 1, QC_colors_current=QC_colors_internal_indel)
plot_internal_add('DR_stemlen_loop1020', [10,20], ['MM_pos', 1], counter, 3, norm_internal_indel_ins, showleg = False, plot_name = inframe_mut_figS4b, pred_factor = 1, QC_colors_current=QC_colors_internal_indel)
plot_internal_add('DR_stemlen_loop1020', [10,20], ['spacer_pos', 1], counter, 4, norm_internal_indel_ins, showleg = False, plot_name = inframe_mut_figS4b, pred_factor = 1, QC_colors_current=QC_colors_internal_indel)
plot_internal_add('DR_stemlen_loop1020', [10,20], ['spacer_middle_pos', 1], counter, 5, norm_internal_indel_ins, showleg = False, plot_name = inframe_mut_figS4b, pred_factor = 1, QC_colors_current=QC_colors_internal_indel)
plot_internal_add('DR_stemlen_loop1020', [10,20], ['flank_pos', 1], counter, 6, norm_internal_indel_del, showleg = False, plot_name = inframe_mut_figS4b, pred_factor = 1, QC_colors_current=QC_colors_internal_indel)
plot_internal_add('DR_stemlen_loop1020', [10,20], ['motif_pos_genome_middle', 1], counter, 7, norm_internal_indel_del, showleg = False, plot_name = inframe_mut_figS4b, pred_factor = 1, QC_colors_current=QC_colors_internal_indel)
plot_internal_add('DR_stemlen_loop1020', [10,20], ['MM_pos', 1], counter, 8, norm_internal_indel_del, showleg = False, plot_name = inframe_mut_figS4b, pred_factor = 1, QC_colors_current=QC_colors_internal_indel)
plot_internal_add('DR_stemlen_loop1020', [10,20], ['spacer_pos', 1], counter, 9, norm_internal_indel_del, showleg = False, plot_name = inframe_mut_figS4b, pred_factor = 1, QC_colors_current=QC_colors_internal_indel)
plot_internal_add('DR_stemlen_loop1020', [10,20], ['spacer_middle_pos', 1], counter, 10, norm_internal_indel_del, showleg = False, plot_name = inframe_mut_figS4b, pred_factor = 1, QC_colors_current=QC_colors_internal_indel)
inframe_mut_figS4b.update_yaxes(title = dict(text = 'Spacer 11-20'), row = counter, col = 1)
counter +=1
plot_internal_add('DR_stemlen_loop20100', [10,20], ['flank_pos', 1], counter, 1, norm_internal_indel_ins, showleg = True if counter == 1 else False, plot_name = inframe_mut_figS4b, pred_factor = 1, QC_colors_current=QC_colors_internal_indel)
plot_internal_add('DR_stemlen_loop20100', [10,20], ['motif_pos_genome_middle', 1], counter, 2, norm_internal_indel_ins, showleg = False, plot_name = inframe_mut_figS4b, pred_factor = 1, QC_colors_current=QC_colors_internal_indel)
plot_internal_add('DR_stemlen_loop20100', [10,20], ['MM_pos', 1], counter, 3, norm_internal_indel_ins, showleg = False, plot_name = inframe_mut_figS4b, pred_factor = 1, QC_colors_current=QC_colors_internal_indel)
plot_internal_add('DR_stemlen_loop20100', [10,20], ['spacer_pos', 1], counter, 4, norm_internal_indel_ins, showleg = False, plot_name = inframe_mut_figS4b, pred_factor = 1, QC_colors_current=QC_colors_internal_indel)
plot_internal_add('DR_stemlen_loop20100', [10,20], ['spacer_middle_pos', 1], counter, 5, norm_internal_indel_ins, showleg = False, plot_name = inframe_mut_figS4b, pred_factor = 1, QC_colors_current=QC_colors_internal_indel)
plot_internal_add('DR_stemlen_loop20100', [10,20], ['flank_pos', 1], counter, 6, norm_internal_indel_del, showleg = False, plot_name = inframe_mut_figS4b, pred_factor = 1, QC_colors_current=QC_colors_internal_indel)
plot_internal_add('DR_stemlen_loop20100', [10,20], ['motif_pos_genome_middle', 1], counter, 7, norm_internal_indel_del, showleg = False, plot_name = inframe_mut_figS4b, pred_factor = 1, QC_colors_current=QC_colors_internal_indel)
plot_internal_add('DR_stemlen_loop20100', [10,20], ['MM_pos', 1], counter, 8, norm_internal_indel_del, showleg = False, plot_name = inframe_mut_figS4b, pred_factor = 1, QC_colors_current=QC_colors_internal_indel)
plot_internal_add('DR_stemlen_loop20100', [10,20], ['spacer_pos', 1], counter, 9, norm_internal_indel_del, showleg = False, plot_name = inframe_mut_figS4b, pred_factor = 1, QC_colors_current=QC_colors_internal_indel)
plot_internal_add('DR_stemlen_loop20100', [10,20], ['spacer_middle_pos', 1], counter, 10, norm_internal_indel_del, showleg = False, plot_name = inframe_mut_figS4b, pred_factor = 1, QC_colors_current=QC_colors_internal_indel)
inframe_mut_figS4b.update_yaxes(title = dict(text = 'Spacer >20'), row = counter, col = 1)
counter +=1
plot_internal_add('DR_longins_stemlen_loop10', [10,20], ['flank_pos', 1], counter, 1, norm_internal_indel_ins, showleg = True if counter == 1 else False, plot_name = inframe_mut_figS4b, pred_factor = 1, QC_colors_current=QC_colors_internal_indel)
plot_internal_add('DR_longins_stemlen_loop10', [10,20], ['motif_pos_genome_middle', 1], counter, 2, norm_internal_indel_ins, showleg = False, plot_name = inframe_mut_figS4b, pred_factor = 1, QC_colors_current=QC_colors_internal_indel)
plot_internal_add('DR_longins_stemlen_loop10', [10,20], ['MM_pos', 1], counter, 3, norm_internal_indel_ins, showleg = False, plot_name = inframe_mut_figS4b, pred_factor = 1, QC_colors_current=QC_colors_internal_indel)
plot_internal_add('DR_longins_stemlen_loop10', [10,20], ['spacer_pos', 1], counter, 4, norm_internal_indel_ins, showleg = False, plot_name = inframe_mut_figS4b, pred_factor = 1, QC_colors_current=QC_colors_internal_indel)
plot_internal_add('DR_longins_stemlen_loop10', [10,20], ['spacer_middle_pos', 1], counter, 5, norm_internal_indel_ins, showleg = False, plot_name = inframe_mut_figS4b, pred_factor = 1, QC_colors_current=QC_colors_internal_indel)
inframe_mut_figS4b.update_yaxes(title = dict(text = 'Ins. len >5'), row = counter, col = 1)
counter +=1
plot_internal_add('DR_shortins_stemlen_loop10', [10,20], ['flank_pos', 1], counter, 1, norm_internal_indel_ins, showleg = True if counter == 1 else False, plot_name = inframe_mut_figS4b, pred_factor = 1, QC_colors_current=QC_colors_internal_indel)
plot_internal_add('DR_shortins_stemlen_loop10', [10,20], ['motif_pos_genome_middle', 1], counter, 2, norm_internal_indel_ins, showleg = False, plot_name = inframe_mut_figS4b, pred_factor = 1, QC_colors_current=QC_colors_internal_indel)
plot_internal_add('DR_shortins_stemlen_loop10', [10,20], ['MM_pos', 1], counter, 3, norm_internal_indel_ins, showleg = False, plot_name = inframe_mut_figS4b, pred_factor = 1, QC_colors_current=QC_colors_internal_indel)
plot_internal_add('DR_shortins_stemlen_loop10', [10,20], ['spacer_pos', 1], counter, 4, norm_internal_indel_ins, showleg = False, plot_name = inframe_mut_figS4b, pred_factor = 1, QC_colors_current=QC_colors_internal_indel)
plot_internal_add('DR_shortins_stemlen_loop10', [10,20], ['spacer_middle_pos', 1], counter, 5, norm_internal_indel_ins, showleg = False, plot_name = inframe_mut_figS4b, pred_factor = 1, QC_colors_current=QC_colors_internal_indel)
inframe_mut_figS4b.update_yaxes(title = dict(text = 'Ins. len <=5'), row = counter, col = 1)
inframe_mut_figS4b.update_yaxes(zeroline = False, range = [-1.1,2.6], type = 'log', dtick = 1)
inframe_mut_figS4b.add_shape(type='line', x0=0.5, x1=0.5, y0=-0.25, y1=1.05, xref='paper', yref='paper', line=dict(color='rgb(150,150,150)', width = 3))
inframe_mut_figS4b.update_layout(title = dict(text = 'Insertions                                                               Deletions', x = 0.225, font = dict(size = 18)))
inframe_mut_figS4b.update_xaxes(title = dict(text = 'Motif length', font = dict(size = 14), standoff = 0), row = 6, col = 1)
inframe_mut_figS4b.update_layout(xaxis36_showticklabels=True); inframe_mut_figS4b.update_layout(xaxis37_showticklabels=True); inframe_mut_figS4b.update_layout(xaxis38_showticklabels=True); inframe_mut_figS4b.update_layout(xaxis39_showticklabels=True); inframe_mut_figS4b.update_layout(xaxis40_showticklabels=True)
inframe_mut_figS4b.update_layout(width = 1000, height = 800, margin = dict(l = 45, r = 25, b = 35, t = 50), legend=dict(y = 0.32, x = 0.52, orientation='h'))

inframe_mut_figS4b.show()

In [None]:
inframe_mut_figS4b.write_image('./plots/revision_DR_indelfreq_figS4b.png', format='png', scale = 10, engine = 'orca')

#### Plot for Fig. S5b - IRs  <a name="mutation_internal_indels_plots_figS5B"></a>

[Return to Table of Contents](#TOC)

In [None]:
counter = 0
inframe_mut_figS5b = make_subplots(rows=4, cols=10, shared_yaxes=True, shared_xaxes = True, vertical_spacing = 0.030, horizontal_spacing = 0.02, subplot_titles = ['Flank', 'Within Motif', 'MM pos.', 'Sp. ends', 'Sp. mid', 'Flank', 'Within Motif', 'MM pos.', 'Sp. ends', 'Sp. mid'])
counter =1
plot_internal_add('IR_stemlen_loop04', [10,20], ['flank_pos', 1], counter, 1, norm_internal_indel_ins, showleg = True if counter == 1 else False, plot_name = inframe_mut_figS5b, pred_factor = 1, QC_colors_current=QC_colors_internal_indel)
plot_internal_add('IR_stemlen_loop04', [10,20], ['motif_pos_genome_middle', 1], counter, 2, norm_internal_indel_ins, showleg = False, plot_name = inframe_mut_figS5b, pred_factor = 1, QC_colors_current=QC_colors_internal_indel)
plot_internal_add('IR_stemlen_loop04', [10,20], ['MM_pos', 1], counter, 3, norm_internal_indel_ins, showleg = False, plot_name = inframe_mut_figS5b, pred_factor = 1, QC_colors_current=QC_colors_internal_indel)
plot_internal_add('IR_stemlen_loop04', [10,20], ['spacer_pos', 1], counter, 4, norm_internal_indel_ins, showleg = False, plot_name = inframe_mut_figS5b, pred_factor = 1, QC_colors_current=QC_colors_internal_indel)
plot_internal_add('IR_stemlen_loop04', [10,20], ['spacer_middle_pos', 1], counter, 5, norm_internal_indel_ins, showleg = False, plot_name = inframe_mut_figS5b, pred_factor = 1, QC_colors_current=QC_colors_internal_indel)
plot_internal_add('IR_stemlen_loop04', [10,20], ['flank_pos', 1], counter, 6, norm_internal_indel_del, showleg = False, plot_name = inframe_mut_figS5b, pred_factor = 1, QC_colors_current=QC_colors_internal_indel)
plot_internal_add('IR_stemlen_loop04', [10,20], ['motif_pos_genome_middle', 1], counter, 7, norm_internal_indel_del, showleg = False, plot_name = inframe_mut_figS5b, pred_factor = 1, QC_colors_current=QC_colors_internal_indel)
plot_internal_add('IR_stemlen_loop04', [10,20], ['MM_pos', 1], counter, 8, norm_internal_indel_del, showleg = False, plot_name = inframe_mut_figS5b, pred_factor = 1, QC_colors_current=QC_colors_internal_indel)
plot_internal_add('IR_stemlen_loop04', [10,20], ['spacer_pos', 1], counter, 9, norm_internal_indel_del, showleg = False, plot_name = inframe_mut_figS5b, pred_factor = 1, QC_colors_current=QC_colors_internal_indel)
plot_internal_add('IR_stemlen_loop04', [10,20], ['spacer_middle_pos', 1], counter, 10, norm_internal_indel_del, showleg = False, plot_name = inframe_mut_figS5b, pred_factor = 1, QC_colors_current=QC_colors_internal_indel)
inframe_mut_figS5b.update_yaxes(title = dict(text = 'Spacer 0-4'), row = counter, col = 1)
counter +=1
plot_internal_add('IR_stemlen_loop510', [10,20], ['flank_pos', 1], counter, 1, norm_internal_indel_ins, showleg = True if counter == 1 else False, plot_name = inframe_mut_figS5b, pred_factor = 1, QC_colors_current=QC_colors_internal_indel)
plot_internal_add('IR_stemlen_loop510', [10,20], ['motif_pos_genome_middle', 1], counter, 2, norm_internal_indel_ins, showleg = False, plot_name = inframe_mut_figS5b, pred_factor = 1, QC_colors_current=QC_colors_internal_indel)
plot_internal_add('IR_stemlen_loop510', [10,20], ['MM_pos', 1], counter, 3, norm_internal_indel_ins, showleg = False, plot_name = inframe_mut_figS5b, pred_factor = 1, QC_colors_current=QC_colors_internal_indel)
plot_internal_add('IR_stemlen_loop510', [10,20], ['spacer_pos', 1], counter, 4, norm_internal_indel_ins, showleg = False, plot_name = inframe_mut_figS5b, pred_factor = 1, QC_colors_current=QC_colors_internal_indel)
plot_internal_add('IR_stemlen_loop510', [10,20], ['spacer_middle_pos', 1], counter, 5, norm_internal_indel_ins, showleg = False, plot_name = inframe_mut_figS5b, pred_factor = 1, QC_colors_current=QC_colors_internal_indel)
plot_internal_add('IR_stemlen_loop510', [10,20], ['flank_pos', 1], counter, 6, norm_internal_indel_del, showleg = False, plot_name = inframe_mut_figS5b, pred_factor = 1, QC_colors_current=QC_colors_internal_indel)
plot_internal_add('IR_stemlen_loop510', [10,20], ['motif_pos_genome_middle', 1], counter, 7, norm_internal_indel_del, showleg = False, plot_name = inframe_mut_figS5b, pred_factor = 1, QC_colors_current=QC_colors_internal_indel)
plot_internal_add('IR_stemlen_loop510', [10,20], ['MM_pos', 1], counter, 8, norm_internal_indel_del, showleg = False, plot_name = inframe_mut_figS5b, pred_factor = 1, QC_colors_current=QC_colors_internal_indel)
plot_internal_add('IR_stemlen_loop510', [10,20], ['spacer_pos', 1], counter, 9, norm_internal_indel_del, showleg = False, plot_name = inframe_mut_figS5b, pred_factor = 1, QC_colors_current=QC_colors_internal_indel)
plot_internal_add('IR_stemlen_loop510', [10,20], ['spacer_middle_pos', 1], counter, 10, norm_internal_indel_del, showleg = False, plot_name = inframe_mut_figS5b, pred_factor = 1, QC_colors_current=QC_colors_internal_indel)
inframe_mut_figS5b.update_yaxes(title = dict(text = 'Spacer 5-10'), row = counter, col = 1)
counter +=1
plot_internal_add('IR_stemlen_loop1020', [10,20], ['flank_pos', 1], counter, 1, norm_internal_indel_ins, showleg = True if counter == 1 else False, plot_name = inframe_mut_figS5b, pred_factor = 1, QC_colors_current=QC_colors_internal_indel)
plot_internal_add('IR_stemlen_loop1020', [10,20], ['motif_pos_genome_middle', 1], counter, 2, norm_internal_indel_ins, showleg = False, plot_name = inframe_mut_figS5b, pred_factor = 1, QC_colors_current=QC_colors_internal_indel)
plot_internal_add('IR_stemlen_loop1020', [10,20], ['MM_pos', 1], counter, 3, norm_internal_indel_ins, showleg = False, plot_name = inframe_mut_figS5b, pred_factor = 1, QC_colors_current=QC_colors_internal_indel)
plot_internal_add('IR_stemlen_loop1020', [10,20], ['spacer_pos', 1], counter, 4, norm_internal_indel_ins, showleg = False, plot_name = inframe_mut_figS5b, pred_factor = 1, QC_colors_current=QC_colors_internal_indel)
plot_internal_add('IR_stemlen_loop1020', [10,20], ['spacer_middle_pos', 1], counter, 5, norm_internal_indel_ins, showleg = False, plot_name = inframe_mut_figS5b, pred_factor = 1, QC_colors_current=QC_colors_internal_indel)
plot_internal_add('IR_stemlen_loop1020', [10,20], ['flank_pos', 1], counter, 6, norm_internal_indel_del, showleg = False, plot_name = inframe_mut_figS5b, pred_factor = 1, QC_colors_current=QC_colors_internal_indel)
plot_internal_add('IR_stemlen_loop1020', [10,20], ['motif_pos_genome_middle', 1], counter, 7, norm_internal_indel_del, showleg = False, plot_name = inframe_mut_figS5b, pred_factor = 1, QC_colors_current=QC_colors_internal_indel)
plot_internal_add('IR_stemlen_loop1020', [10,20], ['MM_pos', 1], counter, 8, norm_internal_indel_del, showleg = False, plot_name = inframe_mut_figS5b, pred_factor = 1, QC_colors_current=QC_colors_internal_indel)
plot_internal_add('IR_stemlen_loop1020', [10,20], ['spacer_pos', 1], counter, 9, norm_internal_indel_del, showleg = False, plot_name = inframe_mut_figS5b, pred_factor = 1, QC_colors_current=QC_colors_internal_indel)
plot_internal_add('IR_stemlen_loop1020', [10,20], ['spacer_middle_pos', 1], counter, 10, norm_internal_indel_del, showleg = False, plot_name = inframe_mut_figS5b, pred_factor = 1, QC_colors_current=QC_colors_internal_indel)
inframe_mut_figS5b.update_yaxes(title = dict(text = 'Spacer 11-20'), row = counter, col = 1)
counter +=1
plot_internal_add('IR_stemlen_loop20100', [10,20], ['flank_pos', 1], counter, 1, norm_internal_indel_ins, showleg = True if counter == 1 else False, plot_name = inframe_mut_figS5b, pred_factor = 1, QC_colors_current=QC_colors_internal_indel)
plot_internal_add('IR_stemlen_loop20100', [10,20], ['motif_pos_genome_middle', 1], counter, 2, norm_internal_indel_ins, showleg = False, plot_name = inframe_mut_figS5b, pred_factor = 1, QC_colors_current=QC_colors_internal_indel)
plot_internal_add('IR_stemlen_loop20100', [10,20], ['MM_pos', 1], counter, 3, norm_internal_indel_ins, showleg = False, plot_name = inframe_mut_figS5b, pred_factor = 1, QC_colors_current=QC_colors_internal_indel)
plot_internal_add('IR_stemlen_loop20100', [10,20], ['spacer_pos', 1], counter, 4, norm_internal_indel_ins, showleg = False, plot_name = inframe_mut_figS5b, pred_factor = 1, QC_colors_current=QC_colors_internal_indel)
plot_internal_add('IR_stemlen_loop20100', [10,20], ['spacer_middle_pos', 1], counter, 5, norm_internal_indel_ins, showleg = False, plot_name = inframe_mut_figS5b, pred_factor = 1, QC_colors_current=QC_colors_internal_indel)
plot_internal_add('IR_stemlen_loop20100', [10,20], ['flank_pos', 1], counter, 6, norm_internal_indel_del, showleg = False, plot_name = inframe_mut_figS5b, pred_factor = 1, QC_colors_current=QC_colors_internal_indel)
plot_internal_add('IR_stemlen_loop20100', [10,20], ['motif_pos_genome_middle', 1], counter, 7, norm_internal_indel_del, showleg = False, plot_name = inframe_mut_figS5b, pred_factor = 1, QC_colors_current=QC_colors_internal_indel)
plot_internal_add('IR_stemlen_loop20100', [10,20], ['MM_pos', 1], counter, 8, norm_internal_indel_del, showleg = False, plot_name = inframe_mut_figS5b, pred_factor = 1, QC_colors_current=QC_colors_internal_indel)
plot_internal_add('IR_stemlen_loop20100', [10,20], ['spacer_pos', 1], counter, 9, norm_internal_indel_del, showleg = False, plot_name = inframe_mut_figS5b, pred_factor = 1, QC_colors_current=QC_colors_internal_indel)
plot_internal_add('IR_stemlen_loop20100', [10,20], ['spacer_middle_pos', 1], counter, 10, norm_internal_indel_del, showleg = False, plot_name = inframe_mut_figS5b, pred_factor = 1, QC_colors_current=QC_colors_internal_indel)
inframe_mut_figS5b.update_yaxes(title = dict(text = 'Spacer >20'), row = counter, col = 1)

inframe_mut_figS5b.update_yaxes(zeroline = False, range = [-1.1,2.6], type = 'log', dtick = 1)
inframe_mut_figS5b.add_shape(type='line', x0=0.5, x1=0.5, y0=-0.04, y1=1.05, xref='paper', yref='paper', line=dict(color='rgb(150,150,150)', width = 3))
inframe_mut_figS5b.update_layout(title = dict(text = 'Insertions                                                               Deletions', x = 0.225, font = dict(size = 18)))
inframe_mut_figS5b.update_xaxes(title = dict(text = 'Motif length', font = dict(size = 14), standoff = 0), row = 4, col = 1)
inframe_mut_figS5b.update_layout(xaxis36_showticklabels=True); inframe_mut_figS5b.update_layout(xaxis37_showticklabels=True); inframe_mut_figS5b.update_layout(xaxis38_showticklabels=True); inframe_mut_figS5b.update_layout(xaxis39_showticklabels=True); inframe_mut_figS5b.update_layout(xaxis40_showticklabels=True)
inframe_mut_figS5b.update_layout(width = 1000, height = 550, margin = dict(l = 45, r = 25, b = 35, t = 50), legend=dict(y = -0.03, x = 0.52, orientation='h'))

inframe_mut_figS5b.show()

In [None]:
inframe_mut_figS5b.write_image('./plots/revision_IR_indelfreq_figS5b.png', format='png', scale = 10, engine = 'orca')

#### Plot for Fig. S6b - MRs  <a name="mutation_internal_indels_plots_figS6B"></a>

[Return to Table of Contents](#TOC)

In [None]:
counter = 0
inframe_mut_figS6b = make_subplots(rows=4, cols=10, shared_yaxes=True, shared_xaxes = True, vertical_spacing = 0.025, horizontal_spacing = 0.025, subplot_titles = ['Flank', 'Within Motif', 'MM pos.', 'Sp. ends', 'Sp. mid', 'Flank', 'Within Motif', 'MM pos.', 'Sp. ends', 'Sp. mid'])
counter =1
plot_internal_add('MR_stemlen_loop04', [10,17], ['flank_pos', 1], counter, 1, norm_internal_indel_ins, showleg = True if counter == 1 else False, plot_name = inframe_mut_figS6b, pred_factor = 1, QC_colors_current=QC_colors_internal_indel)
plot_internal_add('MR_stemlen_loop04', [10,17], ['motif_pos_genome_middle', 1], counter, 2, norm_internal_indel_ins, showleg = False, plot_name = inframe_mut_figS6b, pred_factor = 1, QC_colors_current=QC_colors_internal_indel)
plot_internal_add('MR_stemlen_loop04', [10,17], ['MM_pos', 1], counter, 3, norm_internal_indel_ins, showleg = False, plot_name = inframe_mut_figS6b, pred_factor = 1, QC_colors_current=QC_colors_internal_indel)
plot_internal_add('MR_stemlen_loop04', [10,17], ['spacer_pos', 1], counter, 4, norm_internal_indel_ins, showleg = False, plot_name = inframe_mut_figS6b, pred_factor = 1, QC_colors_current=QC_colors_internal_indel)
plot_internal_add('MR_stemlen_loop04', [10,17], ['spacer_middle_pos', 1], counter, 5, norm_internal_indel_ins, showleg = False, plot_name = inframe_mut_figS6b, pred_factor = 1, QC_colors_current=QC_colors_internal_indel)
plot_internal_add('MR_stemlen_loop04', [10,17], ['flank_pos', 1], counter, 6, norm_internal_indel_del, showleg = False, plot_name = inframe_mut_figS6b, pred_factor = 1, QC_colors_current=QC_colors_internal_indel)
plot_internal_add('MR_stemlen_loop04', [10,17], ['motif_pos_genome_middle', 1], counter, 7, norm_internal_indel_del, showleg = False, plot_name = inframe_mut_figS6b, pred_factor = 1, QC_colors_current=QC_colors_internal_indel)
plot_internal_add('MR_stemlen_loop04', [10,17], ['MM_pos', 1], counter, 8, norm_internal_indel_del, showleg = False, plot_name = inframe_mut_figS6b, pred_factor = 1, QC_colors_current=QC_colors_internal_indel)
plot_internal_add('MR_stemlen_loop04', [10,17], ['spacer_pos', 1], counter, 9, norm_internal_indel_del, showleg = False, plot_name = inframe_mut_figS6b, pred_factor = 1, QC_colors_current=QC_colors_internal_indel)
plot_internal_add('MR_stemlen_loop04', [10,17], ['spacer_middle_pos', 1], counter, 10, norm_internal_indel_del, showleg = False, plot_name = inframe_mut_figS6b, pred_factor = 1, QC_colors_current=QC_colors_internal_indel)
inframe_mut_figS6b.update_yaxes(title = dict(text = 'Spacer 0-4'), row = counter, col = 1)
counter +=1
plot_internal_add('MR_stemlen_loop510', [10,17], ['flank_pos', 1], counter, 1, norm_internal_indel_ins, showleg = True if counter == 1 else False, plot_name = inframe_mut_figS6b, pred_factor = 1, QC_colors_current=QC_colors_internal_indel)
plot_internal_add('MR_stemlen_loop510', [10,17], ['motif_pos_genome_middle', 1], counter, 2, norm_internal_indel_ins, showleg = False, plot_name = inframe_mut_figS6b, pred_factor = 1, QC_colors_current=QC_colors_internal_indel)
plot_internal_add('MR_stemlen_loop510', [10,17], ['MM_pos', 1], counter, 3, norm_internal_indel_ins, showleg = False, plot_name = inframe_mut_figS6b, pred_factor = 1, QC_colors_current=QC_colors_internal_indel)
plot_internal_add('MR_stemlen_loop510', [10,17], ['spacer_pos', 1], counter, 4, norm_internal_indel_ins, showleg = False, plot_name = inframe_mut_figS6b, pred_factor = 1, QC_colors_current=QC_colors_internal_indel)
plot_internal_add('MR_stemlen_loop510', [10,17], ['spacer_middle_pos', 1], counter, 5, norm_internal_indel_ins, showleg = False, plot_name = inframe_mut_figS6b, pred_factor = 1, QC_colors_current=QC_colors_internal_indel)
plot_internal_add('MR_stemlen_loop510', [10,17], ['flank_pos', 1], counter, 6, norm_internal_indel_del, showleg = False, plot_name = inframe_mut_figS6b, pred_factor = 1, QC_colors_current=QC_colors_internal_indel)
plot_internal_add('MR_stemlen_loop510', [10,17], ['motif_pos_genome_middle', 1], counter, 7, norm_internal_indel_del, showleg = False, plot_name = inframe_mut_figS6b, pred_factor = 1, QC_colors_current=QC_colors_internal_indel)
plot_internal_add('MR_stemlen_loop510', [10,17], ['MM_pos', 1], counter, 8, norm_internal_indel_del, showleg = False, plot_name = inframe_mut_figS6b, pred_factor = 1, QC_colors_current=QC_colors_internal_indel)
plot_internal_add('MR_stemlen_loop510', [10,17], ['spacer_pos', 1], counter, 9, norm_internal_indel_del, showleg = False, plot_name = inframe_mut_figS6b, pred_factor = 1, QC_colors_current=QC_colors_internal_indel)
plot_internal_add('MR_stemlen_loop510', [10,17], ['spacer_middle_pos', 1], counter, 10, norm_internal_indel_del, showleg = False, plot_name = inframe_mut_figS6b, pred_factor = 1, QC_colors_current=QC_colors_internal_indel)
inframe_mut_figS6b.update_yaxes(title = dict(text = 'Spacer 5-10'), row = counter, col = 1)
counter +=1
plot_internal_add('MR_stemlen_loop1020', [10,17], ['flank_pos', 1], counter, 1, norm_internal_indel_ins, showleg = True if counter == 1 else False, plot_name = inframe_mut_figS6b, pred_factor = 1, QC_colors_current=QC_colors_internal_indel)
plot_internal_add('MR_stemlen_loop1020', [10,17], ['motif_pos_genome_middle', 1], counter, 2, norm_internal_indel_ins, showleg = False, plot_name = inframe_mut_figS6b, pred_factor = 1, QC_colors_current=QC_colors_internal_indel)
plot_internal_add('MR_stemlen_loop1020', [10,17], ['MM_pos', 1], counter, 3, norm_internal_indel_ins, showleg = False, plot_name = inframe_mut_figS6b, pred_factor = 1, QC_colors_current=QC_colors_internal_indel)
plot_internal_add('MR_stemlen_loop1020', [10,17], ['spacer_pos', 1], counter, 4, norm_internal_indel_ins, showleg = False, plot_name = inframe_mut_figS6b, pred_factor = 1, QC_colors_current=QC_colors_internal_indel)
plot_internal_add('MR_stemlen_loop1020', [10,17], ['spacer_middle_pos', 1], counter, 5, norm_internal_indel_ins, showleg = False, plot_name = inframe_mut_figS6b, pred_factor = 1, QC_colors_current=QC_colors_internal_indel)
plot_internal_add('MR_stemlen_loop1020', [10,17], ['flank_pos', 1], counter, 6, norm_internal_indel_del, showleg = False, plot_name = inframe_mut_figS6b, pred_factor = 1, QC_colors_current=QC_colors_internal_indel)
plot_internal_add('MR_stemlen_loop1020', [10,17], ['motif_pos_genome_middle', 1], counter, 7, norm_internal_indel_del, showleg = False, plot_name = inframe_mut_figS6b, pred_factor = 1, QC_colors_current=QC_colors_internal_indel)
plot_internal_add('MR_stemlen_loop1020', [10,17], ['MM_pos', 1], counter, 8, norm_internal_indel_del, showleg = False, plot_name = inframe_mut_figS6b, pred_factor = 1, QC_colors_current=QC_colors_internal_indel)
plot_internal_add('MR_stemlen_loop1020', [10,17], ['spacer_pos', 1], counter, 9, norm_internal_indel_del, showleg = False, plot_name = inframe_mut_figS6b, pred_factor = 1, QC_colors_current=QC_colors_internal_indel)
plot_internal_add('MR_stemlen_loop1020', [10,17], ['spacer_middle_pos', 1], counter, 10, norm_internal_indel_del, showleg = False, plot_name = inframe_mut_figS6b, pred_factor = 1, QC_colors_current=QC_colors_internal_indel)
inframe_mut_figS6b.update_yaxes(title = dict(text = 'Spacer 11-20'), row = counter, col = 1)
counter +=1
plot_internal_add('MR_stemlen_loop20100', [10,17], ['flank_pos', 1], counter, 1, norm_internal_indel_ins, showleg = True if counter == 1 else False, plot_name = inframe_mut_figS6b, pred_factor = 1, QC_colors_current=QC_colors_internal_indel)
plot_internal_add('MR_stemlen_loop20100', [10,17], ['motif_pos_genome_middle', 1], counter, 2, norm_internal_indel_ins, showleg = False, plot_name = inframe_mut_figS6b, pred_factor = 1, QC_colors_current=QC_colors_internal_indel)
plot_internal_add('MR_stemlen_loop20100', [10,17], ['MM_pos', 1], counter, 3, norm_internal_indel_ins, showleg = False, plot_name = inframe_mut_figS6b, pred_factor = 1, QC_colors_current=QC_colors_internal_indel)
plot_internal_add('MR_stemlen_loop20100', [10,17], ['spacer_pos', 1], counter, 4, norm_internal_indel_ins, showleg = False, plot_name = inframe_mut_figS6b, pred_factor = 1, QC_colors_current=QC_colors_internal_indel)
plot_internal_add('MR_stemlen_loop20100', [10,17], ['spacer_middle_pos', 1], counter, 5, norm_internal_indel_ins, showleg = False, plot_name = inframe_mut_figS6b, pred_factor = 1, QC_colors_current=QC_colors_internal_indel)
plot_internal_add('MR_stemlen_loop20100', [10,17], ['flank_pos', 1], counter, 6, norm_internal_indel_del, showleg = False, plot_name = inframe_mut_figS6b, pred_factor = 1, QC_colors_current=QC_colors_internal_indel)
plot_internal_add('MR_stemlen_loop20100', [10,17], ['motif_pos_genome_middle', 1], counter, 7, norm_internal_indel_del, showleg = False, plot_name = inframe_mut_figS6b, pred_factor = 1, QC_colors_current=QC_colors_internal_indel)
plot_internal_add('MR_stemlen_loop20100', [10,17], ['MM_pos', 1], counter, 8, norm_internal_indel_del, showleg = False, plot_name = inframe_mut_figS6b, pred_factor = 1, QC_colors_current=QC_colors_internal_indel)
plot_internal_add('MR_stemlen_loop20100', [10,17], ['spacer_pos', 1], counter, 9, norm_internal_indel_del, showleg = False, plot_name = inframe_mut_figS6b, pred_factor = 1, QC_colors_current=QC_colors_internal_indel)
plot_internal_add('MR_stemlen_loop20100', [10,17], ['spacer_middle_pos', 1], counter, 10, norm_internal_indel_del, showleg = False, plot_name = inframe_mut_figS6b, pred_factor = 1, QC_colors_current=QC_colors_internal_indel)
inframe_mut_figS6b.update_yaxes(title = dict(text = 'Spacer >20'), row = counter, col = 1)

inframe_mut_figS6b.update_yaxes(zeroline = False, range = [-1.1,2.6], type = 'log', dtick = 1)
inframe_mut_figS6b.add_shape(type='line', x0=0.5, x1=0.5, y0=-0.04, y1=1.05, xref='paper', yref='paper', line=dict(color='rgb(150,150,150)', width = 3))
inframe_mut_figS6b.update_layout(title = dict(text = 'Insertions                                                               Deletions', x = 0.225, font = dict(size = 18)))
inframe_mut_figS6b.update_xaxes(dtick = 3)
inframe_mut_figS6b.update_xaxes(title = dict(text = 'Motif length', font = dict(size = 14), standoff = 0), row = 4, col = 1)
inframe_mut_figS6b.update_layout(xaxis36_showticklabels=True); inframe_mut_figS6b.update_layout(xaxis37_showticklabels=True); inframe_mut_figS6b.update_layout(xaxis38_showticklabels=True); inframe_mut_figS6b.update_layout(xaxis39_showticklabels=True); inframe_mut_figS6b.update_layout(xaxis40_showticklabels=True)
inframe_mut_figS6b.update_layout(width = 1023, height = 513, margin = dict(l = 45, r = 25, b = 35, t = 50), legend=dict(y = -0.04, x = 0.52, orientation='h'))

inframe_mut_figS6b.show()

In [None]:
inframe_mut_figS6b.write_image('./plots/revision_MR_indelfreq_figS6b.png', format='png', scale = 10, engine = 'orca')

#### Plot for Fig. S3a - STRs  <a name="mutation_internal_indels_plots_figS3A"></a>

[Return to Table of Contents](#TOC)

In [None]:
repeats_figlist = ['A', 'C', 'AT', 'AC', 'AG', 'ACC', 'AGG', 'ATC', 'AGC', 'AAG', 'AAC', 'AAT']
counter = 0
inframe_mut_figS3a = make_subplots(rows=len(repeats_figlist), cols=8, shared_yaxes=True, shared_xaxes = True, vertical_spacing = 0.015, horizontal_spacing = 0.02, subplot_titles = ['Flank', 'Start/end', 'Within Motif', 'MM position', 'Flank', 'Start/end', 'Within Motif', 'MM position'])
for repeat in repeats_figlist:
    counter +=1
    plot_internal_add('STR', [6,25], [repeat, 'flank_pos', 'perfect'], counter, 1, norm_internal_indel_ins, showleg = True if counter ==1 else False, plot_name = inframe_mut_figS3a, pred_factor = 1, QC_colors_current=QC_colors_internal_indel)
    plot_internal_add('STR', [6,25], [repeat, 'motif_pos', 'perfect'], counter, 2, norm_internal_indel_ins, showleg = False, plot_name = inframe_mut_figS3a, pred_factor = 1, QC_colors_current=QC_colors_internal_indel)
    plot_internal_add('STR', [6,25], [repeat, 'motif_pos_genome_middle', 'perfect'], counter, 3, norm_internal_indel_ins, showleg = False, plot_name = inframe_mut_figS3a, pred_factor = 1, QC_colors_current=QC_colors_internal_indel)
    plot_internal_add('STR', [10,25], [repeat, 'MM_pos_genome', 'inframe'], counter, 4, norm_internal_indel_ins, showleg = False, plot_name = inframe_mut_figS3a, pred_factor = 1, QC_colors_current=QC_colors_internal_indel)
    plot_internal_add('STR', [6,25], [repeat, 'flank_pos', 'perfect'], counter, 5, norm_internal_indel_del, showleg = False, plot_name = inframe_mut_figS3a, pred_factor = 1, QC_colors_current=QC_colors_internal_indel)
    plot_internal_add('STR', [6,25], [repeat, 'motif_pos', 'perfect'], counter, 6, norm_internal_indel_del, showleg = False, plot_name = inframe_mut_figS3a, pred_factor = 1, QC_colors_current=QC_colors_internal_indel)
    plot_internal_add('STR', [6,25], [repeat, 'motif_pos_genome_middle', 'perfect'], counter, 7, norm_internal_indel_del, showleg = False, plot_name = inframe_mut_figS3a, pred_factor = 1, QC_colors_current=QC_colors_internal_indel)
    plot_internal_add('STR', [10,25], [repeat, 'MM_pos_genome', 'inframe'], counter, 8, norm_internal_indel_del, showleg = False, plot_name = inframe_mut_figS3a, pred_factor = 1, QC_colors_current=QC_colors_internal_indel)

    inframe_mut_figS3a.update_yaxes(title = dict(text = repeat, font = dict(size = 18), standoff = 0), row = counter, col = 1)

inframe_mut_figS3a.update_yaxes(zeroline = False, range = [-1.1,3.1], type = 'log', dtick = 1)
inframe_mut_figS3a.add_shape(type='line', x0=0.5, x1=0.5, y0=-0.25, y1=1.05, xref='paper', yref='paper', line=dict(color='rgb(150,150,150)', width = 3))
inframe_mut_figS3a.update_layout(title = dict(text = 'Insertions                                                               Deletions', x = 0.225, font = dict(size = 18)))

inframe_mut_figS3a.update_xaxes(domain=[0, 0.11], col = 1)
inframe_mut_figS3a.update_xaxes(domain=[0.13, 0.24], col = 2)
inframe_mut_figS3a.update_xaxes(domain=[0.26, 0.37], col = 3)
inframe_mut_figS3a.update_xaxes(domain=[0.39, 0.48], col = 4)
inframe_mut_figS3a.update_xaxes(domain=[0.52, 0.63], col = 5)
inframe_mut_figS3a.update_xaxes(domain=[0.65, 0.76], col = 6)
inframe_mut_figS3a.update_xaxes(domain=[0.78, 0.89], col = 7)
inframe_mut_figS3a.update_xaxes(domain=[0.91, 1], col = 8)

inframe_mut_figS3a.update_xaxes(title = dict(text = 'motif length', font = dict(size = 14), standoff = 0), row = counter, col = 1)
inframe_mut_figS3a.update_layout(width = 1000, height = 100*len(repeats_figlist), margin = dict(l = 45, r = 25, b = 0, t = 50), legend=dict(y = -0.02, x = 0.175, orientation='h'))
inframe_mut_figS3a.show()

In [None]:
inframe_mut_figS3a.write_image('./plots/revision_internal_indel_fig_S3a.png', format='png', scale = 10, engine = 'orca')

#### Plots for Fig. S5a, S5b - G4s  <a name="mutation_internal_indels_plots_figS5"></a>

[Return to Table of Contents](#TOC)

In [None]:
G4_QCeffect_indel = G4_norm_summary_indel_tri[-1.0607] - G4_norm_summary_indel_tri[-np.inf]

G4_newlistofpositions = ["run 5'", "run_positions_middle", "run 3'", "loop 5'", "loop_positions_middle",  "loop 3'", "loop 1nt"]
G4_newlistofnames = ["5'", 'mid',"3'", "5'", 'mid', "3'", '1nt']

G4_QCeffect_indel_up = dict()
G4_QCeffect_indel_down = dict()

count = -0.35
for position in G4_newlistofpositions:
    G4_QCeffect_indel_up[position] = pd.DataFrame(G4_QCeffect_indel[position].loc[G4_QCeffect_indel[position] >0].copy())
    G4_QCeffect_indel_down[position] = pd.DataFrame(G4_QCeffect_indel[position].loc[G4_QCeffect_indel[position] <0].copy())

    G4_QCeffect_indel_up[position][-np.inf] = G4_norm_summary_indel_tri[-np.inf][position].reindex(G4_QCeffect_indel_up[position].index).copy()
    G4_QCeffect_indel_down[position][-np.inf] = G4_norm_summary_indel_tri[-np.inf][position].reindex(G4_QCeffect_indel_down[position].index).copy()
    G4_QCeffect_indel_up[position][-1.0607] = G4_norm_summary_indel_tri[-1.0607][position].reindex(G4_QCeffect_indel_up[position].index).copy()
    G4_QCeffect_indel_down[position][-1.0607] = G4_norm_summary_indel_tri[-1.0607][position].reindex(G4_QCeffect_indel_down[position].index).copy()

    G4_QCeffect_indel_up[position]['random'] = count + (0.75* np.random.random(len(G4_QCeffect_indel_up[position])))
    G4_QCeffect_indel_down[position]['random'] = count + (0.75* np.random.random(len(G4_QCeffect_indel_down[position])))

    G4_QCeffect_indel_up[position]['name'] = [str(name).replace('_', '>') if G4_QCeffect_indel_up[position][position][name] >1.5 else '' for name in G4_QCeffect_indel_up[position].index]
    G4_QCeffect_indel_down[position]['name'] = [str(name).replace('_', '>') if G4_QCeffect_indel_down[position][position][name] <-1.5 else '' for name in G4_QCeffect_indel_down[position].index]

    count +=1

# Set y-axis location for particular trinucleotides
G4_QCeffect_indel_down["loop 5'"].loc['GTC_G', 'random'] = 2.65
G4_QCeffect_indel_down["loop 5'"].loc['GTA_G', 'random'] = 3
G4_QCeffect_indel_down["loop 5'"].loc['GTT_G', 'random'] = 3.35

G4_QCeffect_indel_down["loop 3'"].loc['TTG_G', 'random'] = 4.65
G4_QCeffect_indel_down["loop 3'"].loc['TAG_T', 'random'] = 5
G4_QCeffect_indel_down["loop 3'"].loc['CAG_T', 'random'] = 5.35

G4_QCeffect_indel_down['loop 1nt'].loc['GTG_G', 'random'] = 5.8
G4_QCeffect_indel_down['loop 1nt'].loc['GAG_T', 'random'] = 6.15

In [None]:
G4_QCeffect_ins = G4_norm_summary_ins_tri[-1.0607] - G4_norm_summary_ins_tri[-np.inf]

G4_newlistofpositions = ["run 5'", "run_positions_middle", "run 3'", "loop 5'", "loop_positions_middle",  "loop 3'", "loop 1nt"]
G4_newlistofnames = ["5'", 'mid',"3'", "5'", 'mid', "3'", '1nt']

G4_QCeffect_ins_up = dict()
G4_QCeffect_ins_down = dict()

count = -0.35
for position in G4_newlistofpositions:
    G4_QCeffect_ins_up[position] = pd.DataFrame(G4_QCeffect_ins[position].loc[G4_QCeffect_ins[position] >0].copy())
    G4_QCeffect_ins_down[position] = pd.DataFrame(G4_QCeffect_ins[position].loc[G4_QCeffect_ins[position] <0].copy())

    G4_QCeffect_ins_up[position][-np.inf] = G4_norm_summary_ins_tri[-np.inf][position].reindex(G4_QCeffect_ins_up[position].index).copy()
    G4_QCeffect_ins_down[position][-np.inf] = G4_norm_summary_ins_tri[-np.inf][position].reindex(G4_QCeffect_ins_down[position].index).copy()
    G4_QCeffect_ins_up[position][-1.0607] = G4_norm_summary_ins_tri[-1.0607][position].reindex(G4_QCeffect_ins_up[position].index).copy()
    G4_QCeffect_ins_down[position][-1.0607] = G4_norm_summary_ins_tri[-1.0607][position].reindex(G4_QCeffect_ins_down[position].index).copy()

    G4_QCeffect_ins_up[position]['random'] = count + (0.75* np.random.random(len(G4_QCeffect_ins_up[position])))
    G4_QCeffect_ins_down[position]['random'] = count + (0.75* np.random.random(len(G4_QCeffect_ins_down[position])))

    G4_QCeffect_ins_up[position]['name'] = [str(name).replace('_', '>') if G4_QCeffect_ins_up[position][position][name] >10 else '' for name in G4_QCeffect_ins_up[position].index]
    G4_QCeffect_ins_down[position]['name'] = [str(name).replace('_', '>') if G4_QCeffect_ins_down[position][position][name] <-10 else '' for name in G4_QCeffect_ins_down[position].index]

    count +=1

# Set y-axis location for particular trinucleotides
G4_QCeffect_ins_down["loop 5'"].loc['GTA_ins', 'random'] = 2.775
G4_QCeffect_ins_down["loop 5'"].loc['GCA_ins', 'random'] = 3.45

In [None]:
G4_QCeffect_del = G4_norm_summary_del_tri[-1.0607] - G4_norm_summary_del_tri[-np.inf]

G4_newlistofpositions = ["run 5'", "run_positions_middle", "run 3'", "loop 5'", "loop_positions_middle",  "loop 3'", "loop 1nt"]
G4_newlistofnames = ["5'", 'mid',"3'", "5'", 'mid', "3'", '1nt']

G4_QCeffect_del_up = dict()
G4_QCeffect_del_down = dict()

count = -0.35
for position in G4_newlistofpositions:
    G4_QCeffect_del_up[position] = pd.DataFrame(G4_QCeffect_del[position].loc[G4_QCeffect_del[position] >0].copy())
    G4_QCeffect_del_down[position] = pd.DataFrame(G4_QCeffect_del[position].loc[G4_QCeffect_del[position] <0].copy())

    G4_QCeffect_del_up[position][-np.inf] = G4_norm_summary_del_tri[-np.inf][position].reindex(G4_QCeffect_del_up[position].index).copy()
    G4_QCeffect_del_down[position][-np.inf] = G4_norm_summary_del_tri[-np.inf][position].reindex(G4_QCeffect_del_down[position].index).copy()
    G4_QCeffect_del_up[position][-1.0607] = G4_norm_summary_del_tri[-1.0607][position].reindex(G4_QCeffect_del_up[position].index).copy()
    G4_QCeffect_del_down[position][-1.0607] = G4_norm_summary_del_tri[-1.0607][position].reindex(G4_QCeffect_del_down[position].index).copy()

    G4_QCeffect_del_up[position]['random'] = count + (0.75* np.random.random(len(G4_QCeffect_del_up[position])))
    G4_QCeffect_del_down[position]['random'] = count + (0.75* np.random.random(len(G4_QCeffect_del_down[position])))

    G4_QCeffect_del_up[position]['name'] = [str(name).replace('_', '>') if G4_QCeffect_del_up[position][position][name] >1.5 else '' for name in G4_QCeffect_del_up[position].index]
    G4_QCeffect_del_down[position]['name'] = [str(name).replace('_', '>') if G4_QCeffect_del_down[position][position][name] <-1.5 else '' for name in G4_QCeffect_del_down[position].index]

    count +=1

# Set y-axis location for particular trinucleotides
#G4_QCeffect_del_down["loop 5'"].loc['GTC_G', 'random'] = 2.65

In [None]:
G4_ins_arrow_fig = go.Figure()

for QC_filter in vqslod_list_indel:
    G4_ins_arrow_fig.add_trace(go.Bar(x = list(range(7)), y = G4_norm_summary_ins[0][QC_filter] -1, base = 1, marker = dict(color = QC_colors_internal_indel[0][QC_filter]), showlegend = True, name = QC_colors_internal_indel['name'][QC_filter], error_y=dict(type='data', symmetric=False, array = pd.Series(G4_norm_summary_ins[2][QC_filter] - G4_norm_summary_ins[0][QC_filter]), arrayminus = pd.Series(G4_norm_summary_ins[0][QC_filter] - G4_norm_summary_ins[1][QC_filter]), color=QC_colors_internal_indel[1][QC_filter], thickness=1.5, width=3)))

list_of_annotations = []
for pos in G4_QCeffect_ins_down:
    current_group = G4_QCeffect_ins_down[pos]
    if len(current_group) > 0:
        current_lines = [dict(x=current_group['random'][mut], ax=current_group['random'][mut], y=current_group[-1.0607][mut], ay=current_group[-np.inf][mut], xref='x1', yref='y1', axref='x1', ayref='y1', text = current_group['name'][mut], showarrow=True, arrowhead=2, arrowsize=1, arrowwidth=1.5, arrowcolor='rgba(255,0,0,0.5)', font = dict(color = 'rgb(0,0,0)')) for mut in current_group.index]
        list_of_annotations = list_of_annotations + current_lines
    current_group = G4_QCeffect_ins_up[pos]
    if len(current_group) > 0:
        current_lines = [dict(x=current_group['random'][mut], ax=current_group['random'][mut], y=current_group[-1.0607][mut], ay=current_group[-np.inf][mut], xref='x1', yref='y1', axref='x1', ayref='y1', text = current_group['name'][mut], showarrow=True, arrowhead=2, arrowsize=1, arrowwidth=1.5, arrowcolor='rgba(0,0,0,0.5)', font = dict(color = 'rgb(0,0,0)')) for mut in current_group.index]
        list_of_annotations = list_of_annotations + current_lines

list_of_annotations = list_of_annotations + [dict(text = 'G4 stem                                     spacer', font = dict(size = 18), x = 0.16, y = 0, showarrow = False, xref = 'paper', yref = 'paper')]
G4_ins_arrow_fig.add_shape(type='line', x0=0.4275, x1=0.4275, y0=-0.1, y1=1, xref='paper', yref='paper', line=dict(color='rgb(150,150,150)', width = 3))

G4_ins_arrow_fig.update_xaxes(tickmode = 'array', tickvals = list(range(7)), ticktext = G4_newlistofnames)
G4_ins_arrow_fig.update_yaxes(zeroline = False, range = [-5, 70], tickvals = [1,10,20,30,40,50,60], title = dict(text = 'Relative insertion frequency', standoff = 0, font = dict(size = 18)))
G4_ins_arrow_fig.update_layout(annotations= list_of_annotations)

G4_ins_arrow_fig.update_layout(height = 520, width = 700, margin = dict(l = 40, r = 10, b = 10, t = 10), legend=dict(y = -0.04, x = 0.2, orientation='h'))
G4_ins_arrow_fig.update_layout(barmode = 'overlay')

In [None]:
G4_ins_arrow_fig.write_image('./plots/revision_internal_mutation_G4_ins_fig_S5a.png', format='png', scale = 10, engine = 'orca')

In [None]:
G4_del_arrow_fig = go.Figure()

for QC_filter in vqslod_list_indel:
    G4_del_arrow_fig.add_trace(go.Bar(x = list(range(7)), y = G4_norm_summary_del[0][QC_filter] -1, base = 1, marker = dict(color = QC_colors_internal_indel[0][QC_filter]), showlegend = True, name = QC_colors_internal_indel['name'][QC_filter], error_y=dict(type='data', symmetric=False, array = pd.Series(G4_norm_summary_del[2][QC_filter] - G4_norm_summary_del[0][QC_filter]), arrayminus = pd.Series(G4_norm_summary_del[0][QC_filter] - G4_norm_summary_del[1][QC_filter]), color=QC_colors_internal_indel[1][QC_filter], thickness=1.5, width=3)))

list_of_annotations = []
for pos in G4_QCeffect_del_down:
    current_group = G4_QCeffect_del_down[pos]
    if len(current_group) > 0:
        current_lines = [dict(x=current_group['random'][mut], ax=current_group['random'][mut], y=current_group[-1.0607][mut], ay=current_group[-np.inf][mut], xref='x1', yref='y1', axref='x1', ayref='y1', text = current_group['name'][mut], showarrow=True, arrowhead=2, arrowsize=1, arrowwidth=1.5, arrowcolor='rgba(255,0,0,0.5)', font = dict(color = 'rgb(0,0,0)')) for mut in current_group.index]
        list_of_annotations = list_of_annotations + current_lines
    current_group = G4_QCeffect_del_up[pos]
    if len(current_group) > 0:
        current_lines = [dict(x=current_group['random'][mut], ax=current_group['random'][mut], y=current_group[-1.0607][mut], ay=current_group[-np.inf][mut], xref='x1', yref='y1', axref='x1', ayref='y1', text = current_group['name'][mut], showarrow=True, arrowhead=2, arrowsize=1, arrowwidth=1.5, arrowcolor='rgba(0,0,0,0.5)', font = dict(color = 'rgb(0,0,0)')) for mut in current_group.index]
        list_of_annotations = list_of_annotations + current_lines

list_of_annotations = list_of_annotations + [dict(text = 'G4 stem                                     spacer', font = dict(size = 18), x = 0.16, y = 0, showarrow = False, xref = 'paper', yref = 'paper')]
G4_del_arrow_fig.add_shape(type='line', x0=0.4275, x1=0.4275, y0=-0.1, y1=1, xref='paper', yref='paper', line=dict(color='rgb(150,150,150)', width = 3))

G4_del_arrow_fig.update_xaxes(tickmode = 'array', tickvals = list(range(7)), ticktext = G4_newlistofnames)
G4_del_arrow_fig.update_yaxes(range = [0.05, 10.75], dtick = 1, title = dict(text = 'Relative deletion frequency', standoff = 0, font = dict(size = 18)))
G4_del_arrow_fig.update_layout(annotations= list_of_annotations)

G4_del_arrow_fig.update_layout(height = 520, width = 700, margin = dict(l = 40, r = 10, b = 10, t = 10), legend=dict(y = -0.04, x = 0.2, orientation='h'))
G4_del_arrow_fig.update_layout(barmode = 'overlay')

In [None]:
G4_del_arrow_fig.write_image('./plots/revision_internal_mutation_G4_del_fig_S5b.png', format='png', scale = 10, engine = 'orca')

### Combined indel/SNV plots

####  Plot for Fig. 3a - STRs <a name="mutation_internal_combined_plots_fig3A"></a>

[Return to Table of Contents](#TOC)

In [None]:
repeats_figlist = ['A', 'AT', 'AC', 'AG']
counter = 0
inframe_mut_fig3a = make_subplots(rows=len(repeats_figlist), cols=6, shared_yaxes=False, shared_xaxes = True, vertical_spacing = 0.0375, horizontal_spacing = 0.025,  column_widths=[0.188, 0.188, 0.01, 0.23, 0.188, 0.188], subplot_titles = ['Insertions', 'Deletions', '', 'SNVs in motif', 'Perfecting SNVs', 'Non-perfecting'])
for repeat in repeats_figlist:
    counter +=1
    plot_internal_add('STR', [6,25], [repeat, 'motif_pos_genome_middle', 'perfect'], counter, 1, norm_internal_indel_ins, showleg = False, plot_name = inframe_mut_fig3a, pred_factor = 1, QC_colors_current=QC_colors_internal_indel)
    plot_internal_add('STR', [6,25], [repeat, 'motif_pos_genome_middle', 'perfect'], counter, 2, norm_internal_indel_del, showleg = False, plot_name = inframe_mut_fig3a, pred_factor = 1, QC_colors_current=QC_colors_internal_indel)

    plot_internal_add('STR', [6,30], ['non_pred', repeat, 'motif_pos_genome_middle', 'perfect'], counter, 4, norm_internal_all, showleg = False, plot_name = inframe_mut_fig3a, pred_factor = 1)
    plot_internal_add('STR', [10,30], ['pred', repeat, 'MM_pos_genome', 'inframe'], counter, 5, norm_internal_all, showleg = True if counter == 1 else False, plot_name = inframe_mut_fig3a, pred_factor = 3)
    plot_internal_add('STR', [10,30], ['against_pred', repeat, 'MM_pos_genome', 'inframe'], counter, 6, norm_internal_all, showleg = False, plot_name = inframe_mut_fig3a, pred_factor = 1.5)
    
    inframe_mut_fig3a.update_yaxes(title = dict(text = repeat, font = dict(size = 18), standoff = 0), row = counter, col = 1)

inframe_mut_fig3a.add_shape(type='line', x0=0.38, x1=0.38, y0=-0.05, y1=1.05, xref='paper', yref='paper', line=dict(color='rgb(150,150,150)', width = 3))

inframe_mut_fig3a.update_yaxes(zeroline = False, range = [-1.1,3.1], type = 'log', dtick = 1, col = 1)
inframe_mut_fig3a.update_yaxes(zeroline = False, showticklabels = False, range = [-1.1,3.1], type = 'log', dtick = 1, col = 2)
inframe_mut_fig3a.update_yaxes(zeroline = False, range = [-0.5,14.99], dtick = 4, col = 4)
inframe_mut_fig3a.update_yaxes(zeroline = False, showticklabels = False, range = [-0.5,14.99], dtick = 4, col = 5)
inframe_mut_fig3a.update_yaxes(zeroline = False, showticklabels = False, range = [-0.5,14.99], dtick = 4, col = 6)

inframe_mut_fig3a.update_xaxes(dtick = 5)
inframe_mut_fig3a.update_xaxes(title = dict(text = 'motif length', font = dict(size = 14), standoff = 0), row = counter, col = 1)
inframe_mut_fig3a.update_layout(width = 800, height = 120*len(repeats_figlist), margin = dict(l = 45, r = 25, b = 35, t = 50), legend=dict(y = -0.04, x = 0.175, orientation='h'))
inframe_mut_fig3a.show()

In [None]:
inframe_mut_fig3a.write_image('./plots/revision_ACcor_internal_STR_indel_snv_combined_fig_3a.png', format='png', scale = 10, engine = 'orca')

####  Plot for Fig. S6c - ZDNA <a name="mutation_internal_combined_plots_figS6C"></a>

[Return to Table of Contents](#TOC)

In [None]:
ZDNA_mut_fig = make_subplots(rows=1, cols=6, shared_yaxes=True, shared_xaxes = True, vertical_spacing = 0.020, horizontal_spacing = 0.02, subplot_titles = ['Flank', 'Within Motif', 'Flank', 'Within Motif', 'Flank', 'Within Motif'])
plot_internal_add('ZDNA', [10,18], ['non_pred', 'flank_pos'], 1, 1, norm_internal_all, showleg = False, plot_name = ZDNA_mut_fig, pred_factor = 1)
plot_internal_add('ZDNA', [10,20], ['non_pred', 'motif_pos_genome_middle'], 1, 2, norm_internal_all, showleg = True, plot_name = ZDNA_mut_fig, pred_factor = 1)

plot_internal_add('ZDNA', [10,20], ['flank_pos'], 1, 3, norm_internal_indel_ins, showleg = False, plot_name = ZDNA_mut_fig, pred_factor = 1, QC_colors_current=QC_colors_internal_indel)
plot_internal_add('ZDNA', [10,20], ['motif_pos_genome_middle'], 1, 4, norm_internal_indel_ins, showleg = False, plot_name = ZDNA_mut_fig, pred_factor = 1, QC_colors_current=QC_colors_internal_indel)
plot_internal_add('ZDNA', [10,20], ['flank_pos'], 1, 5, norm_internal_indel_del, showleg = False, plot_name = ZDNA_mut_fig, pred_factor = 1, QC_colors_current=QC_colors_internal_indel)
plot_internal_add('ZDNA', [10,20], ['motif_pos_genome_middle'], 1, 6, norm_internal_indel_del, showleg = False, plot_name = ZDNA_mut_fig, pred_factor = 1, QC_colors_current=QC_colors_internal_indel)

ZDNA_mut_fig.add_shape(type='line', x0=0.33, x1=0.33, y0=-0.25, y1=1.25, xref='paper', yref='paper', line=dict(color='rgb(150,150,150)', width = 3))
ZDNA_mut_fig.add_shape(type='line', x0=0.67, x1=0.67, y0=-0.25, y1=1.25, xref='paper', yref='paper', line=dict(color='rgb(150,150,150)', width = 3))
ZDNA_mut_fig.update_layout(title = dict(text = 'SNVs                            Insertions                          Deletions', x = 0.175, font = dict(size = 18)))
ZDNA_mut_fig.update_yaxes(title = dict(text = 'Z-DNA'), row = 1, col = 1)
ZDNA_mut_fig.update_yaxes(zeroline = False, range = [-1.1,1.7], type = 'log', dtick = 1)
ZDNA_mut_fig.update_xaxes(title = dict(text = 'motif length', font = dict(size = 14), standoff = 0), row = 1, col = 1)

ZDNA_mut_fig.update_layout(width = 800, height = 150, margin = dict(l = 45, r = 5, b = 0, t = 50), legend=dict(y = -0.4, x = 0.2, orientation='h'))
ZDNA_mut_fig.show()

In [None]:
ZDNA_mut_fig.write_image('./plots/revision_ACcor_ZDNA_mutfreq_figS6c.png', format='png', scale = 10, engine = 'orca')

## STR insertion fidelity <a name="mutation_internal_STR_insertion_fidelity"></a>

#### Calculation of error frequency <a name="mutation_internal_STR_insertion_fidelity_calculation"></a>

[Return to Table of Contents](#TOC)

In [None]:
# Collect all indels within STRs
STRs_indel = dict()
for chrom in range(chr_range,23):
    indel_in_STR = variants_indel_all[chrom].loc[variants_indel_all[chrom]['POS'].isin(pos_expand_all['STR'].loc[pos_expand_all['STR']['chrom'] == chrom]['pos'])].copy()
    STRs_current = all_STRs_unique.loc[all_STRs_unique['chrom'] == chrom][['start', 'end', 'chrom', 'length', 'repeat', 'repeat_frame_L', 'Strand', 'status', 'Sequence', 'MM_pos']]
    STRs_indel[chrom] = STRs_current.iloc[np.searchsorted(STRs_current['end'], indel_in_STR['POS'])].copy()
    STRs_indel[chrom][['POS', 'REF', 'ALT', 'AS_VQSLOD', 'indel']] = indel_in_STR[['POS', 'REF', 'ALT', 'AS_VQSLOD', 'indel']].values
STRs_indel = pd.concat(STRs_indel)

In [None]:
# Separate insertions and deletions
STRs_ins = STRs_indel.loc[STRs_indel['ALT'].str.len() > 1].copy()
STRs_del = STRs_indel.loc[STRs_indel['REF'].str.len() > 1].copy()
# Adjust for included reference base
STRs_del['del_len'] = STRs_del['REF'].str.len() -1
STRs_ins['ins_len'] = STRs_ins['ALT'].str.len() -1
STRs_ins['ins_seq'] = STRs_ins['ALT'].str[1:]

In [None]:
# Exclude insertions at flank
STRs_ins = STRs_ins.loc[(STRs_ins['POS'] >= STRs_ins['start']) & (STRs_ins['POS'] <= STRs_ins['end'])].copy()

In [None]:
# Find motif frame for the inserted sequence
STRs_ins['repeat_frame_ins'] = [seq[pos-start:pos-start+rep_len] if (pos - start) < (total_len/2) else seq[pos-start-rep_len:pos-start] for seq, pos, start, rep_len, total_len in zip(STRs_ins['Sequence'], STRs_ins['POS'], STRs_ins['start'], STRs_ins['repeat'].str.len(), STRs_ins['length'])]
# Perfect insertion template (repeat unit multipled by integer of insertion length/unit length)
STRs_ins['repeat_unit_altlen'] = STRs_ins['repeat_frame_ins'] * ((STRs_ins['ins_len']) / STRs_ins['repeat'].str.len()).astype(int)

In [None]:
# Insertion template allowing partial repeat units
STRs_ins['ins_template'] = [template + frame[:ins_len-len(template)] for template, frame, ins_len in zip(STRs_ins['repeat_unit_altlen'], STRs_ins['repeat_frame_ins'], STRs_ins['ins_len'])]
# Template with +/- 1nt from reference sequence
STRs_ins['ins_template+'] = [reference_genome[chrom][pos-1] + template + reference_genome[chrom][pos] for chrom, pos, template in zip(STRs_ins['chrom'], STRs_ins['POS'], STRs_ins['ins_template'])]
STRs_ins['ALT+'] = [alt + reference_genome[chrom][pos] for chrom, pos, alt in zip(STRs_ins['chrom'], STRs_ins['POS'], STRs_ins['ALT'])]
# Positions where insertion differs from template
STRs_ins['dup_mut_pos'] = [np.where([a!=b for a,b in zip(seq1, seq2)]) for seq1, seq2 in zip(STRs_ins['ins_template+'], STRs_ins['ALT+'])]
STRs_ins['dup_mut_pos'] = [pos[0] for pos in STRs_ins['dup_mut_pos']]

STRs_ins['#_errors'] = STRs_ins['dup_mut_pos'].str.len()

In [None]:
# Find perfect expansions (insertion sequence is a multiple of repeat unit)
STRs_ins['perfect_expansion'] = STRs_ins['repeat_unit_altlen'] == STRs_ins['ins_seq']
STRs_ins_perfect = STRs_ins.loc[(STRs_ins['status'] == 'perfect') & (STRs_ins['perfect_expansion'] == True) & (STRs_ins['repeat_frame_ins'].str.len() > 0)].copy()
# Find out of register expansions (insertion length is not a multiple of repeat unit length)
STRs_ins_oor = STRs_ins.loc[(STRs_ins['status'] == 'perfect') & (STRs_ins['perfect_expansion'] == False) & STRs_ins['#_errors'] == 0].copy()
# Insertions have errors compared to repeat motif of same length
STRs_ins_imperfect = STRs_ins.loc[(STRs_ins['status'] == 'perfect') & (STRs_ins['perfect_expansion'] == False) & (STRs_ins['#_errors'] > 0)].copy()
# Insertions have fewer than 3 errors, and length of insertion is longer than length or errors
STRs_ins_imperfect_related = STRs_ins_imperfect.loc[(STRs_ins_imperfect['#_errors'] < 3) & (STRs_ins_imperfect['ins_len'] > STRs_ins_imperfect['#_errors'] +1)].copy()

In [None]:
# Insertions with up to 2 SNV errors
STRs_ins_imperfect_related['dup_mut_pos_0'] = [pos[0] for pos in STRs_ins_imperfect_related['dup_mut_pos']]
STRs_ins_imperfect_related['dup_mut_pos_1'] = [pos[1] if len(pos) == 2 else np.nan for pos in STRs_ins_imperfect_related['dup_mut_pos']]

STRs_ins_imperfect_related['error1_tri'] = [seq[pos-1:pos+2] for seq, pos in zip(STRs_ins_imperfect_related['ins_template+'], STRs_ins_imperfect_related['dup_mut_pos_0'])]
STRs_ins_imperfect_related['error1_mut'] = [seq[pos] for seq, pos in zip(STRs_ins_imperfect_related['ALT+'], STRs_ins_imperfect_related['dup_mut_pos_0'])]

STRs_ins_imperfect_related['error2_tri'] = [np.nan if np.isnan(pos) == True else seq[int(pos)-1:int(pos)+2] for seq, pos in zip(STRs_ins_imperfect_related['ins_template+'], STRs_ins_imperfect_related['dup_mut_pos_1'])]
STRs_ins_imperfect_related['error2_mut'] = [np.nan if np.isnan(pos) == True else seq[int(pos)] for seq, pos in zip(STRs_ins_imperfect_related['ALT+'], STRs_ins_imperfect_related['dup_mut_pos_1'])]

STR_ins_muterrors_1 = STRs_ins_imperfect_related.groupby(['repeat', 'length', 'error1_tri', 'error1_mut']).count()['ALT'].unstack().fillna(0).astype(int)
STR_ins_muterrors_2 = STRs_ins_imperfect_related.groupby(['repeat', 'length', 'error2_tri', 'error2_mut']).count()['ALT'].unstack().fillna(0).astype(int)
STR_ins_muterrors_1.index.names = ['repeat', 'length', 'tri']; STR_ins_muterrors_2.index.names = ['repeat', 'length', 'tri']
STR_ins_muterrors_all = STR_ins_muterrors_1.add(STR_ins_muterrors_2, fill_value = 0).astype(int)

In [None]:
# Reorganize data
STR_ins_muterrors_repeat = dict()
repeats_figlist = ['A', 'C', 'AT', 'AC', 'AG', 'ACC', 'AGG', 'ATC', 'AGC', 'AAG', 'AAC', 'AAT', 'AAAT']
for repeat in repeats_figlist:
    STR_ins_muterrors_repeat[repeat] = dict()
    current_repeat = STR_ins_muterrors_all.loc[repeat]
    for length in set(current_repeat.index.get_level_values(0)):
        current_length = current_repeat.loc[length].stack()
        current_length.index = [tri+'_'+alt for tri, alt in zip(current_length.index.get_level_values(0), current_length.index.get_level_values(1))]
        STR_ins_muterrors_repeat[repeat][length] = triplet_combine_RC(current_length, mut_input=True)
    STR_ins_muterrors_repeat[repeat] = pd.concat(STR_ins_muterrors_repeat[repeat])
STR_ins_muterrors_repeat = pd.concat(STR_ins_muterrors_repeat)

In [None]:
# Count all trinucleotides in insertions
def count_tri_from_groupby(df):
    return triplet_combine_RC(pd.Series(flatten([re.findall('...', seq) + re.findall('...', seq[1:]) + re.findall('...', seq[2:]) for seq in df['ins_template+']])).value_counts(), mut_output=True)

STRs_ins_tri_total_repeat = STRs_ins.groupby(['repeat', 'length']).apply(count_tri_from_groupby)
STRs_ins_tri_total_repeat = STRs_ins_tri_total_repeat.transpose().unstack().fillna(0).astype(int)

In [None]:
# Calculate error frequency, weighted to trinucleotide count
STRs_ins_muterrors_freq = STR_ins_muterrors_repeat.div(STRs_ins_tri_total_repeat.reindex(STR_ins_muterrors_repeat.index))
STRs_ins_muterrors_freq_weight = dict()
for repeat in repeats_figlist:
    STRs_ins_muterrors_freq_weight[repeat] = pd.Series(np.nan)
    for length in set(STRs_ins_muterrors_freq.loc[repeat].index.get_level_values(0)):
        STRs_ins_muterrors_freq_weight[repeat][length] = np.ma.average(np.ma.MaskedArray(STRs_ins_muterrors_freq.loc[repeat].loc[length], mask=np.isnan(STRs_ins_muterrors_freq.loc[repeat].loc[length])), weights=STRs_ins_tri_total_repeat.loc[repeat].loc[length])
STRs_ins_muterrors_freq_weight = pd.concat(STRs_ins_muterrors_freq_weight).dropna()
STRs_ins_muterrors_freq_weight = STRs_ins_muterrors_freq_weight.unstack().transpose()

In [None]:
# Calculate error frequency, normalized to denovo mutation frequency, weighted to trinucleotide count
STRs_ins_muterrors_norm = STRs_ins_muterrors_freq / (denovo_freq_RC.reindex(STRs_ins_muterrors_freq.index.get_level_values(2)) / denovo_n_genomes).values
STRs_ins_muterrors_norm_weight = dict()
for repeat in repeats_figlist:
    STRs_ins_muterrors_norm_weight[repeat] = pd.Series(np.nan)
    for length in set(STRs_ins_muterrors_norm.loc[repeat].index.get_level_values(0)):
        STRs_ins_muterrors_norm_weight[repeat][length] = np.ma.average(np.ma.MaskedArray(STRs_ins_muterrors_norm.loc[repeat].loc[length], mask=np.isnan(STRs_ins_muterrors_norm.loc[repeat].loc[length])), weights=STRs_ins_tri_total_repeat.loc[repeat].loc[length])
STRs_ins_muterrors_norm_weight = pd.concat(STRs_ins_muterrors_norm_weight).dropna()
STRs_ins_muterrors_norm_weight = STRs_ins_muterrors_norm_weight.unstack().transpose()

####  Plot for Fig. S3d - STR insertion fidelity <a name="mutation_internal_STR_insertion_fidelity_plots_figS3D"></a>

[Return to Table of Contents](#TOC)

In [None]:
repeats_figlist = ['A', 'C', 'AT', 'AC', 'AG', 'ACC', 'AGG', 'ATC', 'AGC', 'AAG', 'AAC', 'AAT', 'AAAT']
STR_fidelity_fig = make_subplots(rows = 1, cols = len(repeats_figlist), shared_yaxes = True, subplot_titles = repeats_figlist)
counter = 0
for repeat in repeats_figlist:
    counter +=1
    STR_fidelity_fig.add_trace(go.Bar(x = STRs_ins_muterrors_freq_weight[repeat].dropna().index, y = STRs_ins_muterrors_freq_weight[repeat].dropna()[:-1], name = repeat, showlegend = False), row = 1, col = counter)
STR_fidelity_fig.update_yaxes(type = 'log', range = [-3,-1])#, dtick = np.log10(2))
STR_fidelity_fig.add_shape(type='line', x0=0, x1=1, y0=1, y1=1, line=dict(color='Black', width = .5), xref = 'paper')
STR_fidelity_fig.update_xaxes(title = dict(text = 'motif length', font = dict(size = 14)), row = 1, col = 1)
STR_fidelity_fig.update_yaxes(title = dict(text = 'per base error density in expansions', font = dict(size = 14)), row = 1, col = 1)

STR_fidelity_fig.update_layout(width = 800, height = 350, margin = dict(l = 65, r = 5, b = 40, t = 20))

STR_fidelity_fig.show()

In [None]:
STR_fidelity_fig.write_image('./plots/revision_STR_fidelity_fig_S3d.png', format='png', scale = 10, engine = 'orca')

### Ratio of insertions to deletions <a name="mutation_internal_STR_insertion_deletion_ratio"></a>

- (insertion frequency * insertion length * percent of insertions that are expansions) / (deletion frequency * deletion length)

[Return to Table of Contents](#TOC)

In [None]:
# Locate deletions within repeat tract
STRs_del['seq+'] = [reference_genome[chrom][start-1: end+1] for chrom, start, end in zip(STRs_del['chrom'], STRs_del['start'], STRs_del['end'])]
STRs_del['deletion_pos'] = [[match.start(0) for match in re.finditer(deletion, seq, overlapped = True)] for deletion, seq in zip(STRs_del['REF'], STRs_del['seq+'])]
STRs_del['n_deletion_found'] = [len(location) for location in STRs_del['deletion_pos']]

# Count perfect and non-perfect expansions
STRs_ins['expansion'] = ((STRs_ins['perfect_expansion'] == True) | ((STRs_ins['#_errors'] < 3) & (STRs_ins['ins_len'] > STRs_ins['#_errors'] +1)))

In [None]:
# Calculate indel length and expansion percent
del_length = STRs_del.groupby(['repeat', 'status', 'length']).sum()['del_len']
ins_length = STRs_ins.groupby(['repeat', 'status', 'expansion', 'length']).sum()['ins_len']

# Calculate ratio
repeats_figlist = ['A', 'C', 'AT', 'AC', 'AG', 'ACC', 'AGG', 'ATC', 'AGC', 'AAG', 'AAC', 'AAT', 'AAAT']
ins_del_bias = pd.DataFrame()
for repeat in repeats_figlist:
    ins_del_bias[repeat] = ((((norm_internal_indel_ins['STR'][0][-1.0607][repeat]['motif_pos_genome_middle']['perfect'] * normtorandom_indel['ins'][-1.0607]) * ins_length[repeat]['perfect'][True])) / (((norm_internal_indel_del['STR'][0][-1.0607][repeat]['motif_pos_genome_middle']['perfect'] * normtorandom_indel['del'][-1.0607]) * del_length[repeat]['perfect'])))

####  Plot for Fig. S3c - STR ins/del ratio <a name="mutation_internal_STR_insertion_deletion_ratio_plots_figS3C"></a>

[Return to Table of Contents](#TOC)

In [None]:
indel_bias_fig = make_subplots(rows = 1, cols = len(repeats_figlist), shared_yaxes = True, subplot_titles = repeats_figlist)
counter = 0
for repeat in repeats_figlist:
    counter +=1
    indel_bias_fig.add_trace(go.Bar(x = ins_del_bias[repeat].dropna().index, y = ins_del_bias[repeat].dropna()[:-1], name = repeat, showlegend = False), row = 1, col = counter)
indel_bias_fig.update_yaxes(type = 'log', dtick = np.log10(2), tickmode = 'array', tickvals = [1024, 256, 64, 16, 4, 1, 1/4, 1/16, 1/64, 1/256, 1/1024], ticktext = [1024, 256, 64, 16, 4, 1, '1/4', '1/16', '1/64', '1/256', '1/1024'])

indel_bias_fig.add_shape(type='line', x0=0, x1=1, y0=1, y1=1, line=dict(color='Black', width = .5), xref = 'paper')
indel_bias_fig.update_xaxes(title = dict(text = 'motif length', font = dict(size = 14)), row = 1, col = 1)
indel_bias_fig.update_yaxes(title = dict(text = 'expansions:contractions', font = dict(size = 14)), row = 1, col = 1)

indel_bias_fig.update_layout(width = 800, height = 350, margin = dict(l = 65, r = 5, b = 40, t = 20))

indel_bias_fig.show()

In [None]:
indel_bias_fig.write_image('./plots/revision_STR_indel_bias_fig_S3c.png', format='png', scale = 10, engine = 'orca')

### Absolute frequency of indels and SNVs <a name="mutation_internal_STR_indel_SNV_rate"></a>

- multiply relative mutation frequencies by independently measured genomic mutation rates for SNVs and indels
- testing hypothesis that SNVs could be explained by length-neutral combination of insertions and deletions

[Return to Table of Contents](#TOC)

In [None]:
# Evolution of the Insertion-Deletion Mutation Rate Across the Tree of Life
# Sung, et al, 2016

ug_snv = 1.82 * (10**-9)        # ug for SNVs: 1.82 x 10^-9
ug_indel = 1.3513 * (10**-8)    # ug for indels: 1.3513 x 10^-8

####  Plot for Fig. S3e <a name="mutation_internal_STR_indel_SNV_rate_plots_figS3E"></a>

[Return to Table of Contents](#TOC)

In [None]:
inframe_mut_figS3e = make_subplots(rows=4, cols=3, shared_xaxes = True, shared_yaxes=True, vertical_spacing = 0.05, horizontal_spacing = 0.025, subplot_titles = ['Mutation at imperfection', 'Insertion in tract', 'Deletion at imperfection'])

counter = 0
for repeat in ['A', 'AT', 'AG', 'AC']:
    counter +=1
    plot_internal_add('STR', [10,30], ['pred', repeat, 'MM_pos_genome', 'inframe'], counter, 1, norm_internal_all, showleg = True if counter ==1 else False, plot_name = inframe_mut_figS3e, pred_factor = ug_snv)
    plot_internal_add('STR', [10,25], [repeat, 'motif_pos_genome_middle', 'inframe'], counter, 2, norm_internal_indel_ins, showleg = False, plot_name = inframe_mut_figS3e, pred_factor = ug_indel, QC_colors_current=QC_colors_internal_indel)
    plot_internal_add('STR', [10,25], [repeat, 'MM_pos_genome', 'inframe'], counter, 3, norm_internal_indel_del, showleg = False, plot_name = inframe_mut_figS3e, pred_factor = ug_indel, QC_colors_current=QC_colors_internal_indel)
    inframe_mut_figS3e.update_yaxes(zeroline = False, title = dict(text = repeat, font = dict(size = 18), standoff = 5), row = counter, col = 1)
    
inframe_mut_figS3e.update_xaxes(title = dict(text = 'motif length', font = dict(size = 14), standoff = 0), row = counter, col = 1)

inframe_mut_figS3e.update_yaxes(type = 'log', range = [-10.1, -5.9], exponentformat = 'e', dtick = 1, domain=[0.72, 1], row = 1)
inframe_mut_figS3e.update_yaxes(type = 'log', range = [-10.15, -6.9], exponentformat = 'e', dtick = 1, domain=[0.48, 0.67], row = 2)
inframe_mut_figS3e.update_yaxes(type = 'log', range = [-10.1, -6.9], exponentformat = 'e', dtick = 1, domain=[0.24, 0.43], row = 3)
inframe_mut_figS3e.update_yaxes(type = 'log', range = [-10.1, -6.9], exponentformat = 'e', dtick = 1, domain=[0, 0.19], row = 4)

inframe_mut_figS3e.update_layout(width = 750, height = 500, margin = dict(l = 65, r = 25, b = 35, t = 20), legend=dict(y = -0.03, x = 0.35, orientation='h'))

inframe_mut_figS3e.show()

In [None]:
inframe_mut_figS3e.write_image('./plots/revision_STR_absolutefreq_figS3e.png', format='png', scale = 10, engine = 'orca')

## Direct repeat duplications <a name="mutation_internal_indel_DR_duplications"></a>

#### Find duplications and count them positionally <a name="mutation_internal_indel_DR_duplications_count"></a>

[Return to Table of Contents](#TOC)

In [None]:
# Retrieve info for insertions that overlap with direct repeats
DRs_ins = dict()
for chrom in range(chr_range,23):
    long_ins_in_DR = variants_ins_long[chrom].loc[variants_ins_long[chrom]['POS'].isin(pos_expand_all['DR'].loc[pos_expand_all['DR']['chrom'] == chrom]['pos'])].copy()
    DRs_current = all_DRs.loc[all_DRs['chrom'] == chrom][['start', 'end', 'chrom', 'length', 'Sequence', 'L_end', 'R_start', 'stem_len', 'spacer', 'seq_L', 'seq_R', '#MM', 'MM_pos', 'MM_pos_L', 'MM_pos_R', 'spacer_pos', "spacer_5'_pos", 'spacer_middle_pos', "spacer_3'_pos"]]
    DRs_ins[chrom] = DRs_current.iloc[np.searchsorted(DRs_current['end'], long_ins_in_DR['POS'])].copy()
    DRs_ins[chrom][['POS', 'REF', 'ALT', 'AS_VQSLOD']] = long_ins_in_DR[['POS', 'REF', 'ALT', 'AS_VQSLOD']].values
DRs_ins = pd.concat(DRs_ins)

In [None]:
# Get sequence of direct repeat +/- 10nt, in order to locate template position of inserted sequences
DRs_ins['seq+'] = [reference_genome[chrom][start-10: end+10] for chrom, start, end in zip(DRs_ins['chrom'], DRs_ins['start'], DRs_ins['end'])]
DRs_ins['insert_pos'] = [[match.start(0) for match in re.finditer(insert, seq, overlapped = True)] for insert, seq in zip(DRs_ins['ALT'], DRs_ins['seq+'])]
DRs_ins['n_insert_found'] = [len(location) for location in DRs_ins['insert_pos']]
# Select DRs with one sequence match from insertion to template
DRs_ins_match = DRs_ins.loc[DRs_ins['n_insert_found'] == 1].copy().reset_index()
DRs_ins_match['insert_pos'] = [match[0] for match in DRs_ins_match['insert_pos']]
DRs_ins_match['alt_len'] = DRs_ins_match['ALT'].str.len()
# Location of insertion within template
DRs_ins_match['L_end'] = DRs_ins_match['L_end'].astype(int)
DRs_ins_match['R_start'] = DRs_ins_match['R_start'].astype(int)

In [None]:
# Function to print color-coded text diagrams showing position of templated insertions and position of repeats and mismatches
def highlight_dr_ins(current_dr):
    current_str_series = pd.DataFrame(enumerate(current_dr['seq+']))[1]

    # label insertion template
    current_str_series[current_dr['insert_pos']+1] = bg(255, 150, 50) + current_str_series[current_dr['insert_pos']+1]
    if current_dr['insert_pos'] + current_dr['alt_len'] in current_str_series.index:
        current_str_series[current_dr['insert_pos'] + current_dr['alt_len']] =  bg.rs + current_str_series[current_dr['insert_pos'] + current_dr['alt_len']]

    # label MM positions, left and right repeats
    current_str_series.index += -10 + current_dr['start']
    for pos in current_dr['MM_pos_L']:
        current_str_series[pos] = ef.inverse + current_str_series[pos] + ef.rs
    for pos in current_dr['MM_pos_R']:
        current_str_series[pos] = ef.inverse + current_str_series[pos] + ef.rs
    current_str_series[current_dr['start']] = fg.blue + current_str_series[current_dr['start']]
    if current_dr['L_end'] != current_dr['R_start']:
        current_str_series[current_dr['L_end']] = fg.rs + current_str_series[current_dr['L_end']]
    current_str_series[current_dr['R_start']] = fg.li_blue + current_str_series[current_dr['R_start']]
    current_str_series[current_dr['end']] = fg.rs + current_str_series[current_dr['end']]
    
    return current_str_series.str.cat()

In [None]:
# Show first 50 text diagrams, used in Fig. S4c
for dr in DRs_ins_match.index[:50]:
    print(highlight_dr_ins(DRs_ins_match.loc[dr]))

In [None]:
# Function to count where insertions fall within the repeat structure
def dr_ins_count(current_dr):
    current_str_series = pd.DataFrame(enumerate(current_dr['seq+']))

    current_str_series['ins_start'] = 0
    current_str_series['ins_mid'] = 0
    current_str_series['ins_flank'] = 0
    current_str_series['pos_type'] = np.nan

    # count insertion start/end points
    current_str_series.loc[current_dr['insert_pos']+1:current_dr['insert_pos']+1, 'ins_start'] +=1
    if current_dr['insert_pos'] + current_dr['alt_len']-1 in current_str_series.index:
        current_str_series.loc[current_dr['insert_pos'] + current_dr['alt_len']-1 : current_dr['insert_pos'] + current_dr['alt_len']-1, 'ins_start'] +=1
    # count insertion mid points
    current_str_series.loc[current_dr['insert_pos']+2:current_dr['insert_pos']+current_dr['alt_len']-2, 'ins_mid'] +=1
     # count insertion flank start/end points
    current_str_series.loc[current_dr['insert_pos']:current_dr['insert_pos'], 'ins_flank'] +=1
    if current_dr['insert_pos']+1 + current_dr['alt_len'] in current_str_series.index:
        current_str_series.loc[current_dr['insert_pos'] + current_dr['alt_len'] : current_dr['insert_pos'] + current_dr['alt_len'], 'ins_flank'] +=1
   
    # count MM positions, left and right repeats
    current_str_series.index += -10 + current_dr['start']
    current_str_series.loc[current_dr['start']-2:current_dr['start']-1, 'pos_type'] = 'flank'
    current_str_series.loc[current_dr['end']:current_dr['end']+1, 'pos_type'] = 'flank'
    current_str_series.loc[current_dr['start']:current_dr['start'], 'pos_type'] = 'L_start'
    current_str_series.loc[current_dr['start']+1:current_dr['L_end']-2, 'pos_type'] = 'repeat_L'
    current_str_series.loc[current_dr['L_end']-1:current_dr['L_end']-1, 'pos_type'] = 'L_end'
    current_str_series.loc[current_dr['L_end']:current_dr['L_end'], 'pos_type'] = 'spacer_start'
    current_str_series.loc[current_dr['L_end']+1:current_dr['R_start']-1, 'pos_type'] = 'spacer'
    current_str_series.loc[current_dr['R_start']-1:current_dr['R_start']-1, 'pos_type'] = 'spacer_end'
    current_str_series.loc[current_dr['R_start']:current_dr['R_start'], 'pos_type'] = 'R_start'
    current_str_series.loc[current_dr['R_start']+1:current_dr['end']-2, 'pos_type'] = 'repeat_R'
    current_str_series.loc[current_dr['end']-1:current_dr['end']-1, 'pos_type'] = 'R_end'
    for pos in current_dr['MM_pos_L']:
        current_str_series.loc[pos:pos, 'pos_type'] = 'mismatch_L'
    for pos in current_dr['MM_pos_R']:
        current_str_series.loc[pos:pos, 'pos_type'] = 'mismatch_R'
    current_str_series['total_count'] = 1

    return current_str_series.groupby(['pos_type']).sum()[['ins_start', 'ins_mid', 'ins_flank', 'total_count']]

In [None]:
# Count where insertions fall within the repeat structure
dr_count_sum = pd.DataFrame()
for dr in DRs_ins_match.index:
    dr_count_sum = dr_count_sum.add(dr_ins_count(DRs_ins_match.loc[dr]), fill_value = 0)

# Calculate frequency
dr_count_freq = dr_count_sum.div(dr_count_sum['total_count'], axis=0)
dr_count_freq['ins_start'] = dr_count_freq['ins_start'] / (dr_count_sum.sum()['ins_start'] / dr_count_sum.sum()['total_count'])
dr_count_freq['ins_flank'] = dr_count_freq['ins_flank'] / (dr_count_sum.sum()['ins_flank'] / dr_count_sum.sum()['total_count'])
dr_count_freq['ins_mid'] = dr_count_freq['ins_mid'] / (dr_count_sum.sum()['ins_mid'] / dr_count_sum.sum()['total_count'])

In [None]:
# Count separately for DRs with short spacers
dr_count_sum_spacerb10 = pd.DataFrame()
for dr in DRs_ins_match.loc[DRs_ins_match['spacer'] <10].index:
    dr_count_sum_spacerb10 = dr_count_sum_spacerb10.add(dr_ins_count(DRs_ins_match.loc[dr]), fill_value = 0)

dr_count_freq_spacerb10 = dr_count_sum_spacerb10.div(dr_count_sum_spacerb10['total_count'], axis=0)
dr_count_freq_spacerb10['ins_start'] = dr_count_freq_spacerb10['ins_start'] / (dr_count_sum_spacerb10.sum()['ins_start'] / dr_count_sum_spacerb10.sum()['total_count'])
dr_count_freq_spacerb10['ins_flank'] = dr_count_freq_spacerb10['ins_flank'] / (dr_count_sum_spacerb10.sum()['ins_flank'] / dr_count_sum_spacerb10.sum()['total_count'])
dr_count_freq_spacerb10['ins_mid'] = dr_count_freq_spacerb10['ins_mid'] / (dr_count_sum_spacerb10.sum()['ins_mid'] / dr_count_sum_spacerb10.sum()['total_count'])

dr_count_freq_spacerb10 = dr_count_freq_spacerb10.reindex(['L_start', 'mismatch_L','repeat_L', 'L_end', 'spacer_start', 'spacer', 'spacer_end', 'R_start', 'mismatch_R','repeat_R', 'R_end', 'flank'])
dr_count_freq_spacerb10.index = ['Left start', 'MM left', 'DR left', 'Left end', 'Spacer start', 'Spacer', 'Spacer end', 'Right start', 'MM right', 'DR right', 'Right end', 'flank']

####  Plot for Fig. S4d <a name="mutation_internal_indel_DR_duplications_plot_figS4D"></a>

[Return to Table of Contents](#TOC)

In [None]:
dr_duplication_figS4d = go.Figure()
dr_duplication_figS4d.add_trace(go.Bar(x = dr_count_freq_spacerb10.index, y = dr_count_freq_spacerb10['ins_start']-1, base = 1, name = 'duplication start/end'))
#dr_duplication_figS4d.add_trace(go.Bar(x = dr_count_freq_spacerb10.index, y = dr_count_freq_spacerb10['ins_mid']-1, base = 1, name = 'duplication middle'))
dr_duplication_figS4d.add_trace(go.Bar(x = dr_count_freq_spacerb10.index, y = dr_count_freq_spacerb10['ins_flank']-1, base = 1, name = 'duplication flank'))
dr_duplication_figS4d.update_yaxes(zeroline = False, type = 'log', dtick = np.log10(2), range = [-0.725,0.95], tickmode = 'array', tickvals = [8,4,2,1,0.5,0.25], ticktext = [8,4,2,1,'1/2', '1/4'], title = dict(text = 'obs. / exp.', standoff = 0, font = dict(size = 14)))
dr_duplication_figS4d.update_layout(width = 450, height = 250, margin = dict(l = 65, r = 25, b = 60, t = 20), legend=dict(y = 1.175, x = 0.05, orientation='h'))
dr_duplication_figS4d.show()

In [None]:
dr_duplication_figS4d.write_image('./plots/revision_DR_ins_position_bias_fig_S4d.png', format='png', scale = 10, engine = 'orca')

#### Errors within DR duplications <a name="mutation_internal_indel_DR_duplications_errors"></a>

[Return to Table of Contents](#TOC)

In [None]:
# Find duplications, allowing for a single error
DRs_ins['insert_pos_1error'] = [[match.start(0) for match in re.finditer('(' + insert + '){e<=1}', seq, overlapped = True)] for insert, seq in zip(DRs_ins['ALT'], DRs_ins['seq+'])]
DRs_ins['n_insert_found_1error'] = [len(location) for location in DRs_ins['insert_pos_1error']]

# Find duplications, allowing for two errors
DRs_ins['insert_pos_2error'] = [[match.start(0) for match in re.finditer('(' + insert + '){e<=2}', seq, overlapped = True)] for insert, seq in zip(DRs_ins['ALT'], DRs_ins['seq+'])]
DRs_ins['n_insert_found_2error'] = [len(location) for location in DRs_ins['insert_pos_2error']]

In [None]:
# Collect duplications with 0, 1 and 2 errors
DR_ins_0error = DRs_ins.loc[(DRs_ins['n_insert_found'] == 1) & (DRs_ins['ALT'].str.len() >= 10)].copy()
DR_ins_1error = DRs_ins.loc[(DRs_ins['n_insert_found'] == 0) & (DRs_ins['n_insert_found_1error'] == 1) & (DRs_ins['ALT'].str.len() >= 10)].copy()
DR_ins_2error = DRs_ins.loc[(DRs_ins['n_insert_found'] == 0) & (DRs_ins['n_insert_found_2error'] == 1)& (DRs_ins['ALT'].str.len() >= 10)].copy()

DR_ins_1error['insert_pos_1error'] = [pos[0] for pos in DR_ins_1error['insert_pos_1error']]
DR_ins_2error['insert_pos_2error'] = [pos[0] for pos in DR_ins_2error['insert_pos_2error']]
DR_ins_0error['insert_pos'] = [pos[0] for pos in DR_ins_0error['insert_pos']]

# Get motif sequence +/-1 nt for counting trinucleotides
DR_ins_0error['dup_seq+'] = [seq[pos-1:pos+length+1] for seq, pos, length in zip(DR_ins_0error['seq+'], DR_ins_0error['insert_pos'], DR_ins_0error['ALT'].str.len())]

In [None]:
# For 1-error duplications, find location of interruptions within each DR
DR_ins_1error['dup_seq'] = [seq[pos:pos+length] for seq, pos, length in zip(DR_ins_1error['seq+'], DR_ins_1error['insert_pos_1error'], DR_ins_1error['ALT'].str.len())]
DR_ins_1error['dup_mut_pos'] = [np.where([a!=b for a,b in zip(seq1, seq2)]) for seq1, seq2 in zip(DR_ins_1error['dup_seq'], DR_ins_1error['ALT'])]
DR_ins_1error['dup_mut_pos'] = [pos[0] for pos in DR_ins_1error['dup_mut_pos']]
DR_ins_1error['dup_seq+'] = [seq[pos-1:pos+length+1] for seq, pos, length in zip(DR_ins_1error['seq+'], DR_ins_1error['insert_pos_1error'], DR_ins_1error['ALT'].str.len())]

# Restrict to SNVs
DR_ins_1error = DR_ins_1error.loc[DR_ins_1error['dup_mut_pos'].map(len) == 1]
DR_ins_1error['dup_mut_pos'] = [pos[0] for pos in DR_ins_1error['dup_mut_pos']]

# Count errors by trinucleotide context
DR_ins_1error['error_tri'] = [seq[pos:pos+3] for seq, pos in zip(DR_ins_1error['dup_seq+'], DR_ins_1error['dup_mut_pos'])]
DR_ins_1error['error_mut'] = [seq[pos] for seq, pos in zip(DR_ins_1error['ALT'], DR_ins_1error['dup_mut_pos'])]
DR_ins_1error_muterrors = DR_ins_1error.groupby(['error_tri', 'error_mut']).count()['ALT'].unstack().reindex(all_triplets).fillna(0).astype(int)

In [None]:
# For 1-error duplications, find location of interruptions within each DR
DR_ins_2error['dup_seq'] = [seq[pos:pos+length] for seq, pos, length in zip(DR_ins_2error['seq+'], DR_ins_2error['insert_pos_2error'], DR_ins_2error['ALT'].str.len())]
DR_ins_2error['dup_mut_pos'] = [np.where([a!=b for a,b in zip(seq1, seq2)]) for seq1, seq2 in zip(DR_ins_2error['dup_seq'], DR_ins_2error['ALT'])]
DR_ins_2error['dup_mut_pos'] = [pos[0] for pos in DR_ins_2error['dup_mut_pos']]
DR_ins_2error['dup_seq+'] = [seq[pos-1:pos+length+1] for seq, pos, length in zip(DR_ins_2error['seq+'], DR_ins_2error['insert_pos_2error'], DR_ins_2error['ALT'].str.len())]

# Restrict to SNVs
DR_ins_2error = DR_ins_2error.loc[DR_ins_2error['dup_mut_pos'].map(len) == 2].copy()

# Position of first and second errors
DR_ins_2error['dup_mut_pos_0'] = [pos[0] for pos in DR_ins_2error['dup_mut_pos']]
DR_ins_2error['dup_mut_pos_1'] = [pos[1] for pos in DR_ins_2error['dup_mut_pos']]

# Count errors by trinucleotide context
DR_ins_2error['error_tri_1'] = [seq[pos:pos+3] for seq, pos in zip(DR_ins_2error['dup_seq+'], DR_ins_2error['dup_mut_pos_0'])]
DR_ins_2error['error_mut_1'] = [seq[pos] for seq, pos in zip(DR_ins_2error['ALT'], DR_ins_2error['dup_mut_pos_0'])]
DR_ins_2error['error_tri_2'] = [seq[pos:pos+3] for seq, pos in zip(DR_ins_2error['dup_seq+'], DR_ins_2error['dup_mut_pos_1'])]
DR_ins_2error['error_mut_2'] = [seq[pos] for seq, pos in zip(DR_ins_2error['ALT'], DR_ins_2error['dup_mut_pos_1'])]

# Distinguish between double nucleotide errors and two SNVs
DR_ins_2error_doublemut = DR_ins_2error.loc[DR_ins_2error['dup_mut_pos_0'] +1 == DR_ins_2error['dup_mut_pos_1']].copy()
DR_ins_2error = DR_ins_2error.loc[DR_ins_2error['dup_mut_pos_0'] +1 != DR_ins_2error['dup_mut_pos_1']].copy()

# Count errors by trinucleotide context
DR_ins_2error_muterrors_1 = DR_ins_2error.groupby(['error_tri_1', 'error_mut_1']).count()['ALT'].unstack().reindex(all_triplets).fillna(0).astype(int)
DR_ins_2error_muterrors_2 = DR_ins_2error.groupby(['error_tri_2', 'error_mut_2']).count()['ALT'].unstack().reindex(all_triplets).fillna(0).astype(int)

In [None]:
# Collect all counts and calculate normalized frequency
DR_ins_muterrors_all = DR_ins_1error_muterrors + DR_ins_2error_muterrors_1 + DR_ins_2error_muterrors_2
DR_ins_muterrors_all = DR_ins_muterrors_all.stack()
DR_ins_muterrors_all.index = [tri+'_'+alt for tri, alt in zip(DR_ins_muterrors_all.index.get_level_values(0), DR_ins_muterrors_all.index.get_level_values(1))]
DR_ins_muterrors_RC = triplet_combine_RC(DR_ins_muterrors_all, mut_input=True)

DR_ins_muterrors_tri_0 = pd.Series(flatten([re.findall('...', seq) + re.findall('...', seq[1:]) + re.findall('...', seq[2:]) for seq in DR_ins_0error['dup_seq+']])).value_counts()
DR_ins_muterrors_tri_1 = pd.Series(flatten([re.findall('...', seq) + re.findall('...', seq[1:]) + re.findall('...', seq[2:]) for seq in DR_ins_1error['dup_seq+']])).value_counts()
DR_ins_muterrors_tri_2 = pd.Series(flatten([re.findall('...', seq) + re.findall('...', seq[1:]) + re.findall('...', seq[2:]) for seq in DR_ins_2error['dup_seq+']])).value_counts()

DR_ins_muterrors_tri = triplet_combine_RC(DR_ins_muterrors_tri_0 + DR_ins_muterrors_tri_1 + DR_ins_muterrors_tri_2, mut_output=True)
DR_ins_muterrors_freq = DR_ins_muterrors_RC / DR_ins_muterrors_tri
DR_ins_muterrors_norm = DR_ins_muterrors_freq / (denovo_freq_RC / denovo_n_genomes)

In [None]:
# Mean of error frequencies, weighted by trinucleotide count
np.average(DR_ins_muterrors_freq, weights = DR_ins_muterrors_tri)

####  Plot for Fig. S4e <a name="mutation_internal_indel_DR_duplications_plot_figS4E"></a>

[Return to Table of Contents](#TOC)

In [None]:
DR_spectrum_plot_figS4e = go.Figure()
for group in colors.index:
    DR_spectrum_plot_figS4e.add_trace(go.Bar(x = DR_ins_muterrors_norm.loc[colors['ind'][group]].index, y = DR_ins_muterrors_norm.loc[colors['ind'][group]], marker = dict(color = colors['color'][group]), name = group, showlegend = True))
DR_spectrum_plot_figS4e.update_xaxes(dtick = dict(font = dict(size = 6)))
DR_spectrum_plot_figS4e.update_yaxes(exponentformat = 'e', title = dict(text = 'Relative mutation frequency', standoff = 0, font = dict(size = 14)))
DR_spectrum_plot_figS4e.update_layout(width = 1400, height = 250, margin = dict(l = 75, r = 25, b = 60, t = 20))

DR_spectrum_plot_figS4e.show()

In [None]:
DR_spectrum_plot_figS4e.write_image('./plots/revision_DR_dup_error_spectrum_fig_S4e.png', format='png', scale = 10, engine = 'orca')

####  Frequency of double-nucleotide variants, used in Fig. S4f <a name="mutation_internal_indel_DR_duplications_figS4F"></a>

[Return to Table of Contents](#TOC)

In [None]:
# Function to combine reverse complements for double-nucleotide variants
def reverse_complement_dnv(mut):
    complement = {'A': 'T', 'C': 'G', 'G': 'C', 'T': 'A', 'N': 'N'}
    return ''.join([complement[base] for base in mut.split('>')[0][::-1]])+'>'+''.join([complement[base] for base in mut.split('>')[1][::-1]])

dinuc = [base1+base2 for base1 in ['A', 'T', 'G', 'C'] for base2 in ['A', 'T', 'G', 'C']]

dinuc_F = ['AA', 'AG', 'AC', 'TG', 'TC', 'GG']
dinuc_RC = ['TT', 'CT', 'GT', 'CA', 'GA', 'CC']
dinuc_sym = ['AT', 'TA', 'GC', 'CG']

dinuc_mut_F = [di1 + '>' + di2 for di1 in dinuc_F for di2 in dinuc if (di1[0] != di2[0]) & (di1[1] != di2[1])]
dinuc_mut_RC = [reverse_complement_dnv(mut) for mut in dinuc_mut_F]
dinuc_mut_sym = [di1 + '>' + di2 for di1 in dinuc_sym for di2 in dinuc if (di1[0] != di2[0]) & (di1[1] != di2[1])]

In [None]:
DR_dnv_sum = pd.Series([pos1[1] + pos2[1] + '>' + mut1 + mut2 for pos1, pos2, mut1, mut2 in zip(DR_ins_2error_doublemut['error_tri_1'], DR_ins_2error_doublemut['error_tri_2'], DR_ins_2error_doublemut['error_mut_1'], DR_ins_2error_doublemut['error_mut_2'])]).value_counts()
DR_dnv_sum_RC = pd.concat([(DR_dnv_sum.reindex(dinuc_mut_F) + DR_dnv_sum.reindex(dinuc_mut_RC).set_axis(dinuc_mut_F)), DR_dnv_sum.reindex(dinuc_mut_sym)])
DR_dnv_freq = DR_dnv_sum_RC / DR_dnv_sum_RC.sum()

DR_ins_muterrors_di_0 = pd.Series(flatten([re.findall('..', seq) + re.findall('..', seq[1:]) for seq in DR_ins_0error['dup_seq+']])).value_counts()
DR_ins_muterrors_di_1 = pd.Series(flatten([re.findall('..', seq) + re.findall('..', seq[1:]) for seq in DR_ins_1error['dup_seq+']])).value_counts()
DR_ins_muterrors_di_2 = pd.Series(flatten([re.findall('..', seq) + re.findall('..', seq[1:]) for seq in DR_ins_2error['dup_seq+']])).value_counts()

DR_ins_muterrors_di = DR_ins_muterrors_di_0 + DR_ins_muterrors_di_1 + DR_ins_muterrors_di_2

In [None]:
# List of DNVs sorted by frequency, for Fig. S4f
(DR_dnv_sum_RC / (DR_ins_muterrors_di.reindex([di[:2] for di in DR_dnv_sum_RC.index])).values).dropna().sort_values(ascending = False)

In [None]:
DR_dnv_sum_RC.sum() / (DR_ins_0error['ALT'].str.len().sum() + DR_ins_1error['ALT'].str.len().sum() + DR_ins_2error['ALT'].str.len().sum())