## Silent bystander edits for predicting pegRNA efficiency with PRIDICT2.0

Silent bystander edits can increase the efficiency of (e)pegRNA efficiencies. This notebook allows to create inputs for PRIDICT2.0 that contain silent bystander edits, up to 5 bp up- and downstream of the edit.

Requirements:
- PRIDICT input sequence with **150 bp** context on both sides of the edit (e.g. 150bp - (A/G) - 150bp; longer context needed than the 100bp for standard PRIDICT2.0)
- PRIDICT2.0 conda environment (includes necessary packages)
- Replacements up to 5 bp allowed (5+ bp replacements will lead to too many options with silent bystanders to reasonably get PRIDICT2.0 predictions for all)

How to use:
- Use this notebook to create an input batch file for PRIDICT2.0 prediction with silent bystanders
- We only provide silent bystander predictions with **1bp replacements** or **multibp replacements** (not with insertion/deletions)
- If sequence is in exon, PRIDICT input sequence has to be *in frame* (ORF_start=0) and variable "silent" should be "yes"
- Keep variable "change_edit_bases" as default "no" if you do not wish to change your defined edit bases even if this would lead to the same amino acid.
  Example: NNN(GTA/CAT)NNN would be a V to H change, but same would be the case with NNN(GTA/CA**G**)NNN. If choosing "no" then changes within brackets will be kept and NNN(GTA/CA**G**)NNN would not be used.
- If you want to create any bystander (also non-silent) set variable "silent" to "no".

- Input: PRIDICT input format, but with 150bp flanking bases on both sides
- Manual function: Create silent bystander PRIDICT inputs for 1 mutation/edit
- Batch function: Get the inputs for silent bystander for all PRIDICT inputs in an input .csv file
- Finally run PRIDICT2.0 (batch mode) with created input sequences to get efficiency predictions

Optional:
- Summarize predictions of all bystander-variants into a single file, by selecting best predicted pegRNA for each variant (last section)

### Import necessary packages

In [5]:
from Bio.Seq import Seq
import re
import pandas as pd
from itertools import product
import os
import math

### Functions required to run notebook

In [83]:
def generate_all_sequences(length):
    ### Generate all possible sequences of a given length
    nucleotides = ['A', 'T', 'G', 'C']
    return [''.join(seq) for seq in product(nucleotides, repeat=length)]

def generate_combinations(left_context_close, left_options, edit, right_options, right_context_close):
    ### Generate all possible combinations of left and right context with edit
    return [
        left_context_close + left_opt + edit + right_opt + right_context_close
        for left_opt, right_opt in product(left_options, right_options)
    ]

def convert_differences_to_lowercase(option, original):
    ### Convert differences between option and original to lowercase
    return ''.join(
        opt.lower() if opt != orig else opt 
        for opt, orig in zip(option, original)
    )

def split_sequence(seq):
    ### Split a sequence into three parts: before first lowercase letter, between first and last lowercase letter, and after last lowercase letter
    lowercase_positions = [m.start() for m in re.finditer('[a-z]', seq)]
    if not lowercase_positions:
        return seq, '', '', 0
    
    first_lower_pos = lowercase_positions[0]
    last_lower_pos = lowercase_positions[-1]
    before_first_lower = seq[:first_lower_pos]
    between_first_last_lower = seq[first_lower_pos:last_lower_pos + 1]
    after_last_lower = seq[last_lower_pos + 1:]

    return before_first_lower, between_first_last_lower, after_last_lower, len(lowercase_positions)

def validate_context_length(input, minimum_flanking):
    ### Validate that the context length is at least the minimum flanking length
    left_context = len(input.split('(')[0])
    right_context = len(input.split(')')[1])
    if left_context < minimum_flanking or right_context < minimum_flanking:
        raise ValueError('Context length is less than minimum flanking length! Please check your input sequence.')

def process_contexts(input, bystander_window, close_context_len):
    ### Process the input sequence into left and right context
    left_context = input.split('(')[0]
    right_context = input.split(')')[1]
    left_context_start = left_context[:-bystander_window-close_context_len]
    left_context_close = left_context[-bystander_window-close_context_len:-bystander_window]
    left_bystander_window = left_context[-bystander_window:]
    right_context_end = right_context[bystander_window+close_context_len:]
    right_context_close = right_context[bystander_window:bystander_window+close_context_len]
    right_bystander_window = right_context[:bystander_window]
    return left_context_start, left_context_close, left_bystander_window, right_context_close, right_bystander_window, right_context_end

def isDNA(sequence):
    """ Check whether sequence contains only DNA bases. """
    onlyDNA = True
    diff_set = set(sequence) - set('ACTGatgc')
    if diff_set:
        onlyDNA = False
        print('Non-DNA bases detected. Please use ATGC.')
        print(sequence)
        raise ValueError
    return onlyDNA

def primesequenceparsing(sequence: str) -> object:
    """
    Function which takes target sequence with desired edit as input and 
    editing characteristics as output. Edit within brackets () and original
    equence separated by backslash from edited sequence: (A/G) == A to G mutation.
    Placeholder for deletions and insertions is '-'.

    Parameters
    ----------
    sequence : str
        Target sequence with desired edit in brackets ().
    """
    
    sequence = sequence.replace('\n','')  # remove any spaces or linebreaks in input
    sequence = sequence.replace(' ','')
    sequence = sequence.upper()
    if sequence.count('(') != 1:
        print(sequence)
        print('More or less than one bracket found in sequence! Please check your input sequence.')
        raise ValueError

    five_prime_seq = sequence.split('(')[0]
    three_prime_seq = sequence.split(')')[1]

    sequence_set = set(sequence)
    if '/' in sequence_set:
        original_base = sequence.split('/')[0].split('(')[1]
        edited_base = sequence.split('/')[1].split(')')[0]

        # edit flanking bases should *not* be included in the brackets
        if (original_base[0] == edited_base[0]) or (original_base[-1] == edited_base[-1]):
            print(sequence)
            print('Flanking bases should not be included in brackets! Please check your input sequence.')
            raise ValueError
    elif '+' in sequence_set:  #insertion
        original_base = '-'
        edited_base = sequence.split('+')[1].split(')')[0]
    elif '-' in sequence_set:  #deletion
        original_base = sequence.split('-')[1].split(')')[0]
        edited_base = '-'

    # ignore "-" in final sequences (deletions or insertions)
    if original_base == '-':
        original_seq = five_prime_seq + three_prime_seq
        if edited_base != '-':
            mutation_type = 'Insertion'
            correction_length = len(edited_base)
        else:
            print(sequence)
            raise ValueError
    else:
        original_seq = five_prime_seq + original_base + three_prime_seq
        if edited_base == '-':
            mutation_type = 'Deletion'
            correction_length = len(original_base)
        elif len(original_base) == 1 and len(edited_base) == 1:
            if isDNA(original_base) and isDNA(edited_base):  # check if only AGCT is in bases
                mutation_type = '1bpReplacement'
                correction_length = len(original_base)
            else:
                print(sequence)
                print('Non DNA bases found in sequence! Please check your input sequence.')
                raise ValueError
        elif len(original_base) > 1 or len(edited_base) > 1:
            if isDNA(original_base) and isDNA(edited_base):  # check if only AGCT is in bases
                mutation_type = 'MultibpReplacement'
                if len(original_base) == len(
                        edited_base):  # only calculate correction length if replacement does not contain insertion/deletion
                    correction_length = len(original_base)
                else:
                    print(sequence)
                    print('Only 1bp replacements or replacements of equal length (before edit/after edit) are supported! Please check your input sequence.')
                    raise ValueError
            else:
                print(sequence)
                print('Non DNA bases found in sequence! Please check your input sequence.')
                raise ValueError

    if edited_base == '-':
        edited_seq = five_prime_seq + three_prime_seq
    else:
        edited_seq = five_prime_seq + edited_base.lower() + three_prime_seq

    if isDNA(edited_seq) and isDNA(original_seq):  # check whether sequences only contain AGCT
        pass
    else:
        raise ValueError

    basebefore_temp = five_prime_seq[
                      -1:]  # base before the edit, could be changed with baseafter_temp if Rv strand is targeted (therefore the "temp" attribute)
    baseafter_temp = three_prime_seq[:1]  # base after the edit

    editposition_left = len(five_prime_seq)
    editposition_right = len(three_prime_seq)
    return original_base, edited_base, original_seq, edited_seq, editposition_left, editposition_right, mutation_type, correction_length, basebefore_temp, baseafter_temp

def bystander_creation_for_pridict(pridict_input_original, silent_surrounding_AA_nr, ORF_start, name, minimum_flanking, total_edit_limit,max_edit_length, silent='yes', change_edit_bases='no'):
    if silent == 'no':
        if silent_surrounding_AA_nr != 1:
            print('*** Flanking AA number is set to 1 (maximum) for non-silent bystander, as more will lead to memory issues due to a huge amount of possibilities. ***')
            silent_surrounding_AA_nr = 1



    # Map each amino acid to its corresponding codons
    codon_map = {
        'A': ['GCT', 'GCC', 'GCA', 'GCG'],
        'C': ['TGT', 'TGC'],
        'D': ['GAT', 'GAC'],
        'E': ['GAA', 'GAG'],
        'F': ['TTT', 'TTC'],
        'G': ['GGT', 'GGC', 'GGA', 'GGG'],
        'H': ['CAT', 'CAC'],
        'I': ['ATT', 'ATC', 'ATA'],
        'K': ['AAA', 'AAG'],
        'L': ['TTA', 'TTG', 'CTT', 'CTC', 'CTA', 'CTG'],
        'M': ['ATG'],
        'N': ['AAT', 'AAC'],
        'P': ['CCT', 'CCC', 'CCA', 'CCG'],
        'Q': ['CAA', 'CAG'],
        'R': ['CGT', 'CGC', 'CGA', 'CGG', 'AGA', 'AGG'],
        'S': ['TCT', 'TCC', 'TCA', 'TCG', 'AGT', 'AGC'],
        'T': ['ACT', 'ACC', 'ACA', 'ACG'],
        'V': ['GTT', 'GTC', 'GTA', 'GTG'],
        'W': ['TGG'],
        'Y': ['TAT', 'TAC'],
        '*': ['TAA', 'TAG', 'TGA']  # Stop codons
    }
    # assert that ORF_start is in [0, 1, 2]
    if ORF_start not in [0, 1, 2]:
        raise ValueError('ORF_start has to be 0, 1 or 2!')
    
    pridict_input_original = pridict_input_original[ORF_start:]
    minimum_flanking = minimum_flanking - ORF_start # subtract ORF_start from minimum_flanking to account for the shift in the ORF and allow that the minimum flanking is still met
    pridict_input_original = pridict_input_original.upper()
    validate_context_length(pridict_input_original, minimum_flanking)

    # confirm that pridict_input_original contains replacement edit:
    mutation_type = primesequenceparsing(pridict_input_original)[6]
    if not mutation_type in ['1bpReplacement', 'MultibpReplacement']:
        raise ValueError('Only 1bp replacements or multi bp replacements of equal length (before edit/after edit) are supported for silent bystander predictions! Please check your input sequence.')

    left_context = pridict_input_original.split('(')[0]  # bases before brackets
    right_context = pridict_input_original.split(')')[1]  # bases after brackets
    left_context_length = len(left_context)
    # number of bases which are in the context, but already part of the AA that is edited
    remaining_bases = left_context_length % 3

    # sequence/base of the defined target before editing:
    edit_before = pridict_input_original.split('(')[1].split('/')[0]
    # sequence/base of the defined target after editing:
    edit_after = pridict_input_original.split('/')[1].split(')')[0]
    edit_after = convert_differences_to_lowercase(edit_after, edit_before)  # converts the actual edited bases to lowercase; not the whole edited sequence (e.g. AGA to TGT will be tGt and not tgt)
    edit_length = len(edit_before)
    if edit_length > max_edit_length:
        raise ValueError(f'Only edits of up to {max_edit_length} bases are supported! Please check your input sequence.')

    # sequence of the edited target including context
    edited_sequence = left_context + edit_after + right_context
    # sequence of the original target including context
    original_sequence = left_context + edit_before + right_context

    # number of bases which make up the AA(s) that are part of the edit (in brackets of the pridict input)
    edit_AA_length_in_bp = math.ceil((remaining_bases + edit_length)/3)*3
    # number of AA that are edited (silent or not)
    edit_AA_length = int(edit_AA_length_in_bp / 3)
    downstream_AA_context_basenr = edit_AA_length_in_bp - edit_length - remaining_bases

    # sequence upstream that is unchanged (also not silently changed)
    untouched_left_context = left_context[:-remaining_bases-silent_surrounding_AA_nr*3]
    # sequence downstream that is unchanged (also not silently changed)
    untouched_right_context = right_context[downstream_AA_context_basenr+silent_surrounding_AA_nr*3:]

    # sequence of bases that can be changed by the silent bystander mutations (including edit bases)
    start_pos = len(left_context)-remaining_bases-silent_surrounding_AA_nr*3
    end_pos = len(left_context)-remaining_bases+edit_AA_length_in_bp+silent_surrounding_AA_nr*3

    # print(end_pos, edit_AA_length_in_bp, downstream_AA_context_basenr, silent_surrounding_AA_nr*3, remaining_bases)

    original_bases = original_sequence[start_pos:end_pos]
    original_AA_seq = str(Seq(original_bases).translate())

    potentially_changed_bases = edited_sequence[start_pos:end_pos]
    potentially_changed_AA = str(Seq(potentially_changed_bases).translate())
    # print(potentially_changed_bases)
    if silent == 'yes':
    # create all possible codon variants which result in the same AA sequence
        bystander_option_list = []
        for aa in potentially_changed_AA:
            amino_acid_codons = codon_map[aa]
            bystander_option_list.append(amino_acid_codons)

        # generate all possible combinations of codons in the bystander_option_list. For each AA, we have a list of possible codons in there
        all_combinations = list(product(*bystander_option_list))
        # print(len(all_combinations))
        result_strings = [''.join(combination) for combination in all_combinations]

        # remove the mutated sequence without any bystander from this list
        result_strings = [result_string for result_string in result_strings if result_string.upper() != potentially_changed_bases.upper()]
        print('Number of silent options INCLUDING changed edit bases', len(result_strings))

        if change_edit_bases == 'no':
            # remove options where also defined edited bases are changed (silently); check whether every position where we have a lowercase letter in potentially_changed_bases. this position should be the same base in the result_string and potentially_changed_bases
            result_strings = [result_string for result_string in result_strings if all([result_string[i].upper() == potentially_changed_bases[i].upper() for i in range(len(result_string)) if potentially_changed_bases[i].islower()])]
            print('Number of silent options EXCLUDING changed edit bases', len(result_strings))

        result_AAs = [str(Seq(result_string).translate()) for result_string in result_strings]


    elif silent == 'no':
        result_strings = generate_all_sequences(len(potentially_changed_bases))
        # print('generating sequences done')


        # remove the mutated sequence without any bystander from this list
        result_strings = [result_string for result_string in result_strings if result_string.upper() != potentially_changed_bases.upper()]
        print('Number of silent options INCLUDING changed edit bases', len(result_strings))

        if change_edit_bases == 'no':
            # remove options where also defined edited bases are changed (silently); check whether every position where we have a lowercase letter in potentially_changed_bases. this position should be the same base in the result_string and potentially_changed_bases
            result_strings = [result_string for result_string in result_strings if all([result_string[i].upper() == potentially_changed_bases[i].upper() for i in range(len(result_string)) if potentially_changed_bases[i].islower()])]
            print('Number of silent options EXCLUDING changed edit bases', len(result_strings))

        print('before')
        result_AAs = [str(Seq(result_string).translate()) for result_string in result_strings]
        print('after')

    # for each element of result_string, check whether a base is different from the original sequence. If so, convert it to lowercase
    result_strings = [convert_differences_to_lowercase(result_string, original_bases) for result_string in result_strings]

    full_edited_list = [untouched_left_context + result_string + untouched_right_context for result_string in result_strings]

    pridict_input_lst = []
    totalbasechanges_lst = []
    edit_name_bystander_lst = []
    original_length_lst = []
    final_length_lst = []
    edited_bystander_focus_sequence_lst = []
    edited_bystander_focus_AA_lst = []
    editonly_focus_sequence_lst = []
    editonly_nobystander_focus_AA_lst = []
    for index, sequence in enumerate(full_edited_list):
        unchanged_before, edit, unchanged_after, totalbasechanges = split_sequence(sequence)
        original_base = original_sequence[len(unchanged_before):-len(unchanged_after)]
        final_pridict_seq = f"{unchanged_before}({original_base}/{edit}){unchanged_after}"
        edit_name_bystander = f"{name}_{original_base}_{edit}"
        edited_bystander_focuse_sequence = result_strings[index]
        edited_bystander_focus_AA = result_AAs[index]
        edit_name_bystander_lst.append(edit_name_bystander)
        final_length_lst.append(len(edit))
        pridict_input_lst.append(final_pridict_seq)
        totalbasechanges_lst.append(totalbasechanges)
        original_length_lst.append(edit_length)
        edited_bystander_focus_sequence_lst.append(edited_bystander_focuse_sequence)
        edited_bystander_focus_AA_lst.append(edited_bystander_focus_AA)
        editonly_focus_sequence_lst.append(potentially_changed_bases)
        editonly_nobystander_focus_AA_lst.append(potentially_changed_AA)
  
    print(f'Number of possible bystander mutations: {len(result_strings)}')
    print('Sequence flanking edit position (before edit):',original_bases)
    print('Sequence flanking edit position (AFTER edit, without bystander editing):',potentially_changed_bases)
    print('AA flanking edit position (before edit):',original_AA_seq)
    print('AA flanking edit position (after edit, without bystander editing):',potentially_changed_AA)
    print('AA flanking edit position (after edit, WITH bystander editing:',set(result_AAs))

    finaldf = pd.DataFrame(list(zip(
        edit_name_bystander_lst, pridict_input_lst, original_length_lst, final_length_lst, totalbasechanges_lst, edited_bystander_focus_sequence_lst, edited_bystander_focus_AA_lst, editonly_focus_sequence_lst, editonly_nobystander_focus_AA_lst
    )), columns=['sequence_name', 'editseq', 'original_edit_length', 'final_edit_length_with_bystander', 'total_nr_of_base_changes', 'bystander_focus_sequence', 'bystander_focus_AA', 'editedonly_focus_sequence', 'editedonly_focus_AA'])

    # filter finaldf to exclude rows where final_edit_length_with_bystander > total_edit_limit
    finaldf = finaldf[finaldf.final_edit_length_with_bystander <= total_edit_limit]
    return finaldf

def bystander_input_generator(inputpath, inputfilename, outputpath, outputfilename, silent_surrounding_AA_nr, total_edit_limit, max_edit_length, minimum_flanking):
    df = pd.read_csv(inputpath+inputfilename)
    all_dfs = []
    for _, row in df.iterrows():
        name = row["Name"]
        silent = row["silent"]
        change_edit_bases = row["change_edit_bases"]
        print(name)
        ORF_start = row["in_frame"]
        if ORF_start == 'yes':
            ORF_start = 0
        else:
            raise ValueError('ORF start is not in frame! Please check your input sequence. Input sequence has to be in-frame.')
        pridict_input_original = row.pridict_input
        silentbystanderdf = bystander_creation_for_pridict(pridict_input_original, silent_surrounding_AA_nr, ORF_start, name, minimum_flanking, total_edit_limit,max_edit_length, silent=silent, change_edit_bases=change_edit_bases)
        print(f'{len(silentbystanderdf)} silent bystander sequences created for {name}')
        print()
        all_dfs.append(silentbystanderdf)
    final_df = pd.concat(all_dfs, ignore_index=True)
    final_df.to_csv(outputpath+outputfilename)
    return df, final_df

### Manual mode:

In [88]:
### Required variables (adapt to your needs)
name = 'test1_cftr'
pridict_input_original = 'TGGACAGAAACAAAAAAACAATCTTTTAAACAGACTGGAGAGTTTGGGGAAAAAAGGAAGAATTCTATTCTCAATCCAATCAACTCTATACGAAAATTTTCCATTGTGCAAAAGACTCCCTTACAAATGAATGGCATCGAAGAGGATTCT(G/C)ATGAGCCTTTAGAGAGAAGGCTGTCCTTAGTACCAGATTCTGAGCAGGGAGAGGCGATACTGCCTCGCATCAGCGTGATCAGCACTGGCCCCACGCTTCAGGCACGAAGGAGGCAGTCTGTCCTGAACCTGATGACACACTCAGTTAACC'
silent = 'yes'  # default: yes; put to 'no' if you want to get all possible bystander mutations, including non-silent mutations which change the AA sequence
change_edit_bases='no'  # default: no; put to 'yes' if you want to get all possible silent bystander mutations, including those which also change the edit bases you defined in the input (if applicable)
###

### default variables (do not change unless you want to adapt the script)
silent_surrounding_AA_nr = 2 # number of amino acids up- and downstream of edit AA for which silent bystander mutations will be created; default = 2; maximum 4 AA (up- and downstream) are allowed in this script
ORF_start = 0  # 0 means that the ORF starts at the first base of the input sequence; if the ORF starts at the second or third base, change this to 1 or 2, respectively
total_edit_limit = 40 # limit maximum length of edit (including bystander edits) to 40 bases; max. edit length for PRIDICT2.0 predictions is 40 bases
max_edit_length = 10 # maximum length of the edit bases you want to change
minimum_flanking = 94  # minimum edit-flanking length, after correcting for ORF_start. Only required as sanity check; do not change.
###

# run manual bystander creation:
silentbystanderdf = bystander_creation_for_pridict(pridict_input_original, silent_surrounding_AA_nr, ORF_start, name, minimum_flanking, total_edit_limit,max_edit_length, silent=silent, change_edit_bases=change_edit_bases)


Number of silent options INCLUDING changed edit bases 191
Number of silent options EXCLUDING changed edit bases 191
Number of possible bystander mutations: 191
Sequence flanking edit position (before edit): GATTCTGATGAGCCT
Sequence flanking edit position (AFTER edit, without bystander editing): GATTCTcATGAGCCT
AA flanking edit position (before edit): DSDEP
AA flanking edit position (after edit, without bystander editing): DSHEP
AA flanking edit position (after edit, WITH bystander editing: {'DSHEP'}


In [82]:
# preview of silentbystanderdf:
silentbystanderdf.head(10)

Unnamed: 0,sequence_name,editseq,original_edit_length,final_edit_length_with_bystander,total_nr_of_base_changes,bystander_focus_sequence,bystander_focus_AA,editedonly_focus_sequence,editedonly_focus_AA
0,test1_cftr_TCTGATGAG_aaacAaaAa,TGGACAGAAACAAAAAAACAATCTTTTAAACAGACTGGAGAGTTTG...,1,9,7,aaacAaaAa,KQK,TCTcATGAG,SHE
1,test1_cftr_TCTGATGAG_aaacAaaAt,TGGACAGAAACAAAAAAACAATCTTTTAAACAGACTGGAGAGTTTG...,1,9,7,aaacAaaAt,KQN,TCTcATGAG,SHE
2,test1_cftr_TCTGATG_aaacAaa,TGGACAGAAACAAAAAAACAATCTTTTAAACAGACTGGAGAGTTTG...,1,7,6,aaacAaaAG,KQK,TCTcATGAG,SHE
3,test1_cftr_TCTGATGAG_aaacAaaAc,TGGACAGAAACAAAAAAACAATCTTTTAAACAGACTGGAGAGTTTG...,1,9,7,aaacAaaAc,KQN,TCTcATGAG,SHE
4,test1_cftr_TCTGATGAG_aaacAaata,TGGACAGAAACAAAAAAACAATCTTTTAAACAGACTGGAGAGTTTG...,1,9,8,aaacAaata,KQI,TCTcATGAG,SHE
5,test1_cftr_TCTGATGAG_aaacAaatt,TGGACAGAAACAAAAAAACAATCTTTTAAACAGACTGGAGAGTTTG...,1,9,8,aaacAaatt,KQI,TCTcATGAG,SHE
6,test1_cftr_TCTGATGA_aaacAaat,TGGACAGAAACAAAAAAACAATCTTTTAAACAGACTGGAGAGTTTG...,1,8,7,aaacAaatG,KQM,TCTcATGAG,SHE
7,test1_cftr_TCTGATGAG_aaacAaatc,TGGACAGAAACAAAAAAACAATCTTTTAAACAGACTGGAGAGTTTG...,1,9,8,aaacAaatc,KQI,TCTcATGAG,SHE
8,test1_cftr_TCTGATGAG_aaacAaaga,TGGACAGAAACAAAAAAACAATCTTTTAAACAGACTGGAGAGTTTG...,1,9,8,aaacAaaga,KQR,TCTcATGAG,SHE
9,test1_cftr_TCTGATGAG_aaacAaagt,TGGACAGAAACAAAAAAACAATCTTTTAAACAGACTGGAGAGTTTG...,1,9,8,aaacAaagt,KQS,TCTcATGAG,SHE


### Batch mode:

In [86]:
# define input and output paths and filenames
inputpath = './input/'
inputfilename = 'input_testfile.csv' # check input_testfile.csv for details about formatting the input file; required columns: [Name, pridict_input, silent, change_edit_bases, in_frame]
outputpath = './output/'
outputfilename = 'outputfile.csv'
#

### --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
# default variables (do not change unless you want to adapt the script):
silent_surrounding_AA_nr = 2 # number of amino acids up- and downstream of edit AA for which silent bystander mutations will be created; default = 2; maximum 4 AA (up- and downstream) are allowed in this script
total_edit_limit = 40 # limit maximum length of edit (including bystander edits) to 40 bases; max. edit length for PRIDICT2.0 predictions is 40 bases
max_edit_length = 10 # maximum length of the edit bases you want to change
minimum_flanking = 94  # minimum edit-flanking length, after correcting for ORF_start. Only required as sanity check; do not change.
### --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

# run bystander input generator
inputfiledf, outputfiledf = bystander_input_generator(inputpath, inputfilename, outputpath, outputfilename, silent_surrounding_AA_nr, total_edit_limit, max_edit_length, minimum_flanking)

# continue with running PRIDICT2 (outside of this notebook) with the outputfile.csv as input file
# example command: python pridict2_pegRNA_design.py batch --input-dir ./silentbystander_addon/output --input-fname outputfile.csv --output-dir ./predictions

# Optional but NOT RECOMMENDED: run PRIDICT2 from within this notebook. (uncomment !python command below)
# Caveat: takes a LONG time to run; we recommend running it separately via commandline
# !python ../pridict2_pegRNA_design.py batch --input-dir ./output --input-fname outputfile.csv --output-dir ../predictions

test_mutation_1_cftr
Number of silent options INCLUDING changed edit bases 191
Number of silent options EXCLUDING changed edit bases 191
Number of possible bystander mutations: 191
Sequence flanking edit position (before edit): GATTCTGATGAGCCT
Sequence flanking edit position (AFTER edit, without bystander editing): GATTCTcATGAGCCT
AA flanking edit position (before edit): DSDEP
AA flanking edit position (after edit, without bystander editing): DSHEP
AA flanking edit position (after edit, WITH bystander editing: {'DSHEP'}
191 silent bystander sequences created for test_mutation_1_cftr

test_mutation_2_brca1
Number of silent options INCLUDING changed edit bases 31
Number of silent options EXCLUDING changed edit bases 31
Number of possible bystander mutations: 31
Sequence flanking edit position (before edit): AAGGAAGAAAATCAA
Sequence flanking edit position (AFTER edit, without bystander editing): AAGGAAaAAAATCAA
AA flanking edit position (before edit): KEENQ
AA flanking edit position (afte

### Summarize PRIDICT2.0 predictions of silent bystanders after running PRIDICT2.0
- Only run this after you ran PRIDICT2.0 with the output batch file created above. 

- The code below summarizes all the predictions with different silent bystanders in one file, sorts this by K562 score and saves it as summary prediction file.

- For MMR-deficient context, change "sort_value" from "K562" to "HEK".

- From this summary file, we suggest to take e.g. the top 5 pegRNAs and test these in your experimental setup

In [None]:
### Summarize PRIDICT2 predictions:
pridict2_predictions_folder = '../predictions/'  # folder with all PRIDICT2 predictions (default "../predictions/")
summary_prediction_output_folder = './summarized_silent_pridict2_predictions/' # folder where summarized prediction files will be saved
sort_value = 'K562' # change to "HEK" for sorting to MMR-deficient cell line prediction

# filelist of all .csv files in pridict2_predictions_folder:
filelist = [f for f in os.listdir(pridict2_predictions_folder) if f.endswith('pegRNA_Pridict_full.csv')]

for index, row in inputfiledf.iterrows():
    sequence_name = row['Name']
    # get all files in filelist that start with the sequence_name
    sequence_files = [f for f in filelist if f.startswith(sequence_name)]
    # read all files and concatenate them
    all_files = []
    for file in sequence_files:
        all_files.append(pd.read_csv(pridict2_predictions_folder+file))
    all_files_df = pd.concat(all_files, ignore_index=True)
    # sort all_files_df by column "PRIDICT2_0_editing_Score_deep_..." (from highest to lowest)
    all_files_df = all_files_df.sort_values(by='PRIDICT2_0_editing_Score_deep_'+sort_value, ascending=False)
    # save concatenated file
    all_files_df.to_csv(summary_prediction_output_folder+sequence_name+'_all_silent_predictions.csv')