### Notebook to create inputs for PRIDICT2.0 where insertion or deletion location is flexible (e.g. stop-codon insertion).

### Description


If edit location is flexible, several PRIDICT2.0 predictions need to be performed to find the highest predicted option. This notebook creates the input files for all possible options (insertions/deletions).

Requirements:
- Target sequence with the possible deletion/insertion region put into square brackets ([ ]; NOT ()!), including **100 bp** context on both sides of the brackets
- PRIDICT2.0 conda environment (includes necessary packages)
- For insertions:
    - Define "insertion" as edit_type
    - Define insert bases (IUPAC base code; includes ATGC, N, R etc.)
    - Define insertion frequency (default = 1; choose e.g. 3 for in-frame insertions)
    - For in-frame insertions: Adjust target sequence context so that ORF starts at the brackets (e.g. add 2bp at the beginning to have 102 wich is dividable by 3)
- For deletions:
    - Define "deletion" as edit_type
    - Define deletion length
    - Define deletion frequency (default = 1; choose e.g. 3 for in-frame deletions)
    - For in-frame deletions: Adjust target sequence context so that ORF starts at the brackets (e.g. add 2bp at the beginning to have 102 wich is dividable by 3)

How to use:
- Use this notebook to create an input batch file for PRIDICT2.0 prediction with flexible mutations (insertions or deletions)

- Input: Target sequence with square brackets ([]) defining region where insertions or deletions options should be created
- Single function: Create flexible mutation inputs for 1 condition
- Batch function: Get the inputs for all conditions in an input .csv file
- Finally run PRIDICT2.0 (batch mode) with created input sequences to get efficiency predictions

Optional:
- Summarize predictions of all flexible mutations options into a single file, by selecting best predicted pegRNA for each variant (last section)

### Functions required to run notebook 
--> Just press run, no changes required.

In [97]:
import os
import pandas as pd
import itertools
from IPython.display import display, HTML

# IUPAC nucleotide code mapping
IUPAC_CODES = {
    "A": ["A"], "C": ["C"], "G": ["G"], "T": ["T"], "U": ["U"],
    "R": ["G", "A"], "Y": ["C", "T"], "K": ["G", "T"], "M": ["A", "C"],
    "S": ["G", "C"], "W": ["A", "T"], "B": ["G", "T", "C"], "D": ["G", "A", "T"],
    "H": ["A", "C", "T"], "V": ["G", "C", "A"], "N": ["A", "G", "C", "T"]
}

# Function to generate edits
def generate_edits(sequence_name, sequence, edit_type, step=1, insert_value=None, del_length=None, output_file_name="output.csv", visualise=True):
    # test if [] are in sequence, otherwise raise an error
    if "[" not in sequence or "]" not in sequence:
        raise ValueError("Sequence must contain brackets [] to indicate the target region of where to insert or delete.")
    # check if sequence_name, sequence, edit_type and step are provided and if either insert_value or del_length are provided
    if not sequence_name:
        raise ValueError("Sequence name not provided")
    if not sequence:
        raise ValueError("Sequence not provided")
    if edit_type not in ["insertion", "deletion"]:
        raise ValueError("Edit type must be either 'insertion' or 'deletion'")
    if not step:
        raise ValueError("Step size not provided")
    if step < 1:
        raise ValueError("Step size must be greater than 0")

    pre, target, post = sequence.split("[")[0], sequence.split("[")[1].split("]")[0], sequence.split("]")[1]
    edits = []

    # check if pre or post are < 100 bp long and if so, raise an error
    if len(pre) < 100 or len(post) < 100:
        error_parts = []
        if len(pre) < 100:
            error_parts.append(f"Sequence before '[' bracket is too short ({len(pre)} bp; minimum 100 bp required)")
        if len(post) < 100:
            error_parts.append(f"Sequence after ']' bracket is too short ({len(post)} bp; minimum 100 bp required)")
        raise ValueError(" and ".join(error_parts))
    
    
    if edit_type == "insertion":
        if len(target) < 1:
            raise ValueError("Target region is too short (0 bp; minimum 1 bp required to provide different options)")
        if insert_value is None:
            raise ValueError("Insertion sequence not provided.")
        # if any letter of edit_value is not a valid IUPAC code, return an empty list
        invalid_bases = [base for base in insert_value if base.upper() not in IUPAC_CODES]
        if invalid_bases:
            raise ValueError("Invalid DNA base(s): " + ", ".join(invalid_bases))
        possible_insertions = ["".join(p) for p in itertools.product(*[IUPAC_CODES.get(base.upper(), [base]) for base in insert_value])]
        for i in range(0, len(target) + 1, step):
            for insertion in possible_insertions:
                new_seq = pre + target[:i] + "(+" + insertion + ")" + target[i:] + post
                edits.append(new_seq)
    
    elif edit_type == "deletion":
        if len(target) < 2:
            raise ValueError("Target region is too short (1 bp; minimum 2 bp required to provide different deletion options.)")
        if del_length is None:
            raise ValueError("Deletion length not provided")
        elif del_length == 0:
            raise ValueError("Deletion length must be greater than 0 to be of any use")
        elif del_length >= len(target):
            raise ValueError("Deletion length must be shorter than the target region")
        if isinstance(del_length, float):  # if del_length is a float due to pandas import, convert to int if it is a whole number
            if del_length.is_integer():
                del_length = int(del_length)
            else:
                raise ValueError("Deletion length must be a whole integer")
        elif not isinstance(del_length, int):
            raise ValueError("Deletion length must be an integer")
        for i in range(0, len(target) - del_length + 1, step):
            new_seq = pre + target[:i] + "(-" + target[i:i+del_length] + ")" + target[i+del_length:] + post
            edits.append(new_seq)

    
    # create dataframe with columns sequence_name and editseq
    df = pd.DataFrame(columns=["sequence_name", "editseq"])
    df["sequence_name"] = [f"{sequence_name}_{i+1}" for i in range(len(edits))]
    df["editseq"] = edits

    # save dataframe to csv
    df.to_csv("./output/"+output_file_name, index=False)
    print(f"Saved {len(edits)} PRIDICT2 input sequences to ./output/{output_file_name}")

    if visualise:
        # Display the first 50 edits
        html_table = df.head(50).to_html()
        display(HTML("<div style='overflow-y: scroll; height: 300px;'>" + html_table + "</div>"))
    
    return df

def handle_duplicate_sequences(df):
    """
    Detect and handle duplicate sequence names in repetitive DNA sequences (identical edits), printing warnings and removing duplicates.
    Returns deduplicated dataframe.
    """
    # Find duplicate sequence names
    duplicates = df[df.duplicated(['sequence_name'], keep=False)]
    
    if not duplicates.empty:
        # Group duplicates by sequence_name
        for name, group in duplicates.groupby('sequence_name'):
            print(f"\nIdentical sequence_name generated (resulting in the same edit):")
            print(f"Sequence name: {name}")
            print("PRIDICT inputs:")
            for idx, row in group.iterrows():
                print(f"- {row['editseq']}")
            print("Removed duplicate sequences\n")
        
        # Keep first occurrence of each sequence_name
        df = df.drop_duplicates(subset=['sequence_name'], keep='first')
    
    return df

def flexible_mutation_input_generator(inputpath, inputfilename, outputpath, summaryoutputfilename):
    input_df = pd.read_csv(inputpath+inputfilename)
    all_dfs = []
    error_inputs = []
    for _, row in input_df.iterrows():
        sequence_name = row["Name"]
        sequence = row["sequence_with_brackets"]
        edit_type = row["edit_type"]
        insert = row["insert"]
        del_length = row["deletion_length"]
        output_file_name = f"individual_{sequence_name}_flexible.csv"
        step = row["step"]

        print(sequence_name)
        try:
            df = generate_edits(sequence_name, sequence, edit_type, step, insert_value=insert, del_length=del_length, output_file_name=output_file_name, visualise=False)
            all_dfs.append(df)
        except ValueError as e:
            print(f"Error generating edits for {sequence_name}: {e}")
            error_inputs.append((sequence_name, e))
            continue
        
        print()
    final_df = pd.concat(all_dfs, ignore_index=True)

    # Remove very rare duplicate bystander edit in repetitive sequence which are identical (e.g. CCT(CTC/tTt)TGG == C(CTC/tTt)TCTGG)
    final_df = handle_duplicate_sequences(final_df)

    final_df.to_csv(outputpath+summaryoutputfilename, index=False)

    print(f"**********\nSaved {len(final_df)} PRIDICT2 input sequences to summary file {outputpath+summaryoutputfilename}.")
    if error_inputs:
        print("\nErrors occurred for the following sequences:")
        for seq_name, error in error_inputs:
            print(f"{seq_name}: {error}")
    return input_df, final_df

### Single mode:

In [101]:
# Example 1 insertion:
# insert "N" (any base) every 1 bp within the [] brackets
sequence_name = "test_insertion_N"
sequence = "TGCCTGGAGGTGTCTGGGTCCCTCCCCCACCCGACTACTTCACTCTCTGTCCTCTCTGCCCAGGAGCCCAGGATGTGCGAGTTCAAGTGGCTACGGCCGA[CTGTCCTCTCTGCCCAGG]GTGCGAGGCCAGCTCGGGGGCACCGTGGAGCTGCCGTGCCACCTGCTGCCACCTGTTCCTGGACTGTACATCTCCCTGGTGACCTGGCAGCGCCCAGATG"
edit_type = "insertion"
insert = "N"  # Can use IUPAC codes or actual bases (e.g. "N" for any base, "R" for A or G, etc.)
output_file_name = "test_insertion_N_flexible.csv"
step = 1  # Define step size (add the insert every X position within the [] brackets)

df = generate_edits(sequence_name, sequence, edit_type, step, insert_value=insert, output_file_name=output_file_name)

Saved 76 PRIDICT2 input sequences to ./output/test_insertion_N_flexible.csv


Unnamed: 0,sequence_name,editseq
0,test_insertion_N_1,TGCCTGGAGGTGTCTGGGTCCCTCCCCCACCCGACTACTTCACTCTCTGTCCTCTCTGCCCAGGAGCCCAGGATGTGCGAGTTCAAGTGGCTACGGCCGA(+A)CTGTCCTCTCTGCCCAGGGTGCGAGGCCAGCTCGGGGGCACCGTGGAGCTGCCGTGCCACCTGCTGCCACCTGTTCCTGGACTGTACATCTCCCTGGTGACCTGGCAGCGCCCAGATG
1,test_insertion_N_2,TGCCTGGAGGTGTCTGGGTCCCTCCCCCACCCGACTACTTCACTCTCTGTCCTCTCTGCCCAGGAGCCCAGGATGTGCGAGTTCAAGTGGCTACGGCCGA(+G)CTGTCCTCTCTGCCCAGGGTGCGAGGCCAGCTCGGGGGCACCGTGGAGCTGCCGTGCCACCTGCTGCCACCTGTTCCTGGACTGTACATCTCCCTGGTGACCTGGCAGCGCCCAGATG
2,test_insertion_N_3,TGCCTGGAGGTGTCTGGGTCCCTCCCCCACCCGACTACTTCACTCTCTGTCCTCTCTGCCCAGGAGCCCAGGATGTGCGAGTTCAAGTGGCTACGGCCGA(+C)CTGTCCTCTCTGCCCAGGGTGCGAGGCCAGCTCGGGGGCACCGTGGAGCTGCCGTGCCACCTGCTGCCACCTGTTCCTGGACTGTACATCTCCCTGGTGACCTGGCAGCGCCCAGATG
3,test_insertion_N_4,TGCCTGGAGGTGTCTGGGTCCCTCCCCCACCCGACTACTTCACTCTCTGTCCTCTCTGCCCAGGAGCCCAGGATGTGCGAGTTCAAGTGGCTACGGCCGA(+T)CTGTCCTCTCTGCCCAGGGTGCGAGGCCAGCTCGGGGGCACCGTGGAGCTGCCGTGCCACCTGCTGCCACCTGTTCCTGGACTGTACATCTCCCTGGTGACCTGGCAGCGCCCAGATG
4,test_insertion_N_5,TGCCTGGAGGTGTCTGGGTCCCTCCCCCACCCGACTACTTCACTCTCTGTCCTCTCTGCCCAGGAGCCCAGGATGTGCGAGTTCAAGTGGCTACGGCCGAC(+A)TGTCCTCTCTGCCCAGGGTGCGAGGCCAGCTCGGGGGCACCGTGGAGCTGCCGTGCCACCTGCTGCCACCTGTTCCTGGACTGTACATCTCCCTGGTGACCTGGCAGCGCCCAGATG
5,test_insertion_N_6,TGCCTGGAGGTGTCTGGGTCCCTCCCCCACCCGACTACTTCACTCTCTGTCCTCTCTGCCCAGGAGCCCAGGATGTGCGAGTTCAAGTGGCTACGGCCGAC(+G)TGTCCTCTCTGCCCAGGGTGCGAGGCCAGCTCGGGGGCACCGTGGAGCTGCCGTGCCACCTGCTGCCACCTGTTCCTGGACTGTACATCTCCCTGGTGACCTGGCAGCGCCCAGATG
6,test_insertion_N_7,TGCCTGGAGGTGTCTGGGTCCCTCCCCCACCCGACTACTTCACTCTCTGTCCTCTCTGCCCAGGAGCCCAGGATGTGCGAGTTCAAGTGGCTACGGCCGAC(+C)TGTCCTCTCTGCCCAGGGTGCGAGGCCAGCTCGGGGGCACCGTGGAGCTGCCGTGCCACCTGCTGCCACCTGTTCCTGGACTGTACATCTCCCTGGTGACCTGGCAGCGCCCAGATG
7,test_insertion_N_8,TGCCTGGAGGTGTCTGGGTCCCTCCCCCACCCGACTACTTCACTCTCTGTCCTCTCTGCCCAGGAGCCCAGGATGTGCGAGTTCAAGTGGCTACGGCCGAC(+T)TGTCCTCTCTGCCCAGGGTGCGAGGCCAGCTCGGGGGCACCGTGGAGCTGCCGTGCCACCTGCTGCCACCTGTTCCTGGACTGTACATCTCCCTGGTGACCTGGCAGCGCCCAGATG
8,test_insertion_N_9,TGCCTGGAGGTGTCTGGGTCCCTCCCCCACCCGACTACTTCACTCTCTGTCCTCTCTGCCCAGGAGCCCAGGATGTGCGAGTTCAAGTGGCTACGGCCGACT(+A)GTCCTCTCTGCCCAGGGTGCGAGGCCAGCTCGGGGGCACCGTGGAGCTGCCGTGCCACCTGCTGCCACCTGTTCCTGGACTGTACATCTCCCTGGTGACCTGGCAGCGCCCAGATG
9,test_insertion_N_10,TGCCTGGAGGTGTCTGGGTCCCTCCCCCACCCGACTACTTCACTCTCTGTCCTCTCTGCCCAGGAGCCCAGGATGTGCGAGTTCAAGTGGCTACGGCCGACT(+G)GTCCTCTCTGCCCAGGGTGCGAGGCCAGCTCGGGGGCACCGTGGAGCTGCCGTGCCACCTGCTGCCACCTGTTCCTGGACTGTACATCTCCCTGGTGACCTGGCAGCGCCCAGATG


In [102]:
# Example 2 insertion:
# set step to 3 and put bracket "[" in-frame to only get in-frame stop codons
sequence_name = "test_insertion_stop"
sequence = "TGCCTGGAGGTGTCTGGGTCCCTCCCCCACCCGACTACTTCACTCTCTGTCCTCTCTGCCCAGGAGCCCAGGATGTGCGAGTTCAAGTGGCTACGGCCGA[CTGTCCTCTCTGCCCAGG]GTGCGAGGCCAGCTCGGGGGCACCGTGGAGCTGCCGTGCCACCTGCTGCCACCTGTTCCTGGACTGTACATCTCCCTGGTGACCTGGCAGCGCCCAGATG"
edit_type = "insertion"
insert = "TAG"  # Can use IUPAC codes
output_file_name = "test_insertion_stop_flexible.csv"
step = 3  # Define step size (add the insert every X position within the [] brackets)

df = generate_edits(sequence_name, sequence, edit_type, step, insert_value=insert, output_file_name=output_file_name)

Saved 7 PRIDICT2 input sequences to ./output/test_insertion_stop_flexible.csv


Unnamed: 0,sequence_name,editseq
0,test_insertion_stop_1,TGCCTGGAGGTGTCTGGGTCCCTCCCCCACCCGACTACTTCACTCTCTGTCCTCTCTGCCCAGGAGCCCAGGATGTGCGAGTTCAAGTGGCTACGGCCGA(+TAG)CTGTCCTCTCTGCCCAGGGTGCGAGGCCAGCTCGGGGGCACCGTGGAGCTGCCGTGCCACCTGCTGCCACCTGTTCCTGGACTGTACATCTCCCTGGTGACCTGGCAGCGCCCAGATG
1,test_insertion_stop_2,TGCCTGGAGGTGTCTGGGTCCCTCCCCCACCCGACTACTTCACTCTCTGTCCTCTCTGCCCAGGAGCCCAGGATGTGCGAGTTCAAGTGGCTACGGCCGACTG(+TAG)TCCTCTCTGCCCAGGGTGCGAGGCCAGCTCGGGGGCACCGTGGAGCTGCCGTGCCACCTGCTGCCACCTGTTCCTGGACTGTACATCTCCCTGGTGACCTGGCAGCGCCCAGATG
2,test_insertion_stop_3,TGCCTGGAGGTGTCTGGGTCCCTCCCCCACCCGACTACTTCACTCTCTGTCCTCTCTGCCCAGGAGCCCAGGATGTGCGAGTTCAAGTGGCTACGGCCGACTGTCC(+TAG)TCTCTGCCCAGGGTGCGAGGCCAGCTCGGGGGCACCGTGGAGCTGCCGTGCCACCTGCTGCCACCTGTTCCTGGACTGTACATCTCCCTGGTGACCTGGCAGCGCCCAGATG
3,test_insertion_stop_4,TGCCTGGAGGTGTCTGGGTCCCTCCCCCACCCGACTACTTCACTCTCTGTCCTCTCTGCCCAGGAGCCCAGGATGTGCGAGTTCAAGTGGCTACGGCCGACTGTCCTCT(+TAG)CTGCCCAGGGTGCGAGGCCAGCTCGGGGGCACCGTGGAGCTGCCGTGCCACCTGCTGCCACCTGTTCCTGGACTGTACATCTCCCTGGTGACCTGGCAGCGCCCAGATG
4,test_insertion_stop_5,TGCCTGGAGGTGTCTGGGTCCCTCCCCCACCCGACTACTTCACTCTCTGTCCTCTCTGCCCAGGAGCCCAGGATGTGCGAGTTCAAGTGGCTACGGCCGACTGTCCTCTCTG(+TAG)CCCAGGGTGCGAGGCCAGCTCGGGGGCACCGTGGAGCTGCCGTGCCACCTGCTGCCACCTGTTCCTGGACTGTACATCTCCCTGGTGACCTGGCAGCGCCCAGATG
5,test_insertion_stop_6,TGCCTGGAGGTGTCTGGGTCCCTCCCCCACCCGACTACTTCACTCTCTGTCCTCTCTGCCCAGGAGCCCAGGATGTGCGAGTTCAAGTGGCTACGGCCGACTGTCCTCTCTGCCC(+TAG)AGGGTGCGAGGCCAGCTCGGGGGCACCGTGGAGCTGCCGTGCCACCTGCTGCCACCTGTTCCTGGACTGTACATCTCCCTGGTGACCTGGCAGCGCCCAGATG
6,test_insertion_stop_7,TGCCTGGAGGTGTCTGGGTCCCTCCCCCACCCGACTACTTCACTCTCTGTCCTCTCTGCCCAGGAGCCCAGGATGTGCGAGTTCAAGTGGCTACGGCCGACTGTCCTCTCTGCCCAGG(+TAG)GTGCGAGGCCAGCTCGGGGGCACCGTGGAGCTGCCGTGCCACCTGCTGCCACCTGTTCCTGGACTGTACATCTCCCTGGTGACCTGGCAGCGCCCAGATG


In [103]:
# Example deletion:
# delete 2 bases every 1 bp within the [] brackets
sequence_name = "test_deletion_1"
sequence = "TGCCTGGAGGTGTCTGGGTCCCTCCCCCACCCGACTACTTCACTCTCTGTCCTCTCTGCCCAGGAGCCCAGGATGTGCGAGTTCAAGTGGCTACGGCCGA[CTGTCCTCTCTGCCCAGG]GTGCGAGGCCAGCTCGGGGGCACCGTGGAGCTGCCGTGCCACCTGCTGCCACCTGTTCCTGGACTGTACATCTCCCTGGTGACCTGGCAGCGCCCAGATG"
edit_type = "deletion"
del_length = 2  # Define deletion length (number of bases to delete)
output_file_name = "test_deletion_flexible.csv"
step = 1  # Define step size (add the insert every X position within the [] brackets)

df = generate_edits(sequence_name, sequence, edit_type, step,  del_length=del_length, output_file_name=output_file_name)

Saved 17 PRIDICT2 input sequences to ./output/test_deletion_flexible.csv


Unnamed: 0,sequence_name,editseq
0,test_deletion_1_1,TGCCTGGAGGTGTCTGGGTCCCTCCCCCACCCGACTACTTCACTCTCTGTCCTCTCTGCCCAGGAGCCCAGGATGTGCGAGTTCAAGTGGCTACGGCCGA(-CT)GTCCTCTCTGCCCAGGGTGCGAGGCCAGCTCGGGGGCACCGTGGAGCTGCCGTGCCACCTGCTGCCACCTGTTCCTGGACTGTACATCTCCCTGGTGACCTGGCAGCGCCCAGATG
1,test_deletion_1_2,TGCCTGGAGGTGTCTGGGTCCCTCCCCCACCCGACTACTTCACTCTCTGTCCTCTCTGCCCAGGAGCCCAGGATGTGCGAGTTCAAGTGGCTACGGCCGAC(-TG)TCCTCTCTGCCCAGGGTGCGAGGCCAGCTCGGGGGCACCGTGGAGCTGCCGTGCCACCTGCTGCCACCTGTTCCTGGACTGTACATCTCCCTGGTGACCTGGCAGCGCCCAGATG
2,test_deletion_1_3,TGCCTGGAGGTGTCTGGGTCCCTCCCCCACCCGACTACTTCACTCTCTGTCCTCTCTGCCCAGGAGCCCAGGATGTGCGAGTTCAAGTGGCTACGGCCGACT(-GT)CCTCTCTGCCCAGGGTGCGAGGCCAGCTCGGGGGCACCGTGGAGCTGCCGTGCCACCTGCTGCCACCTGTTCCTGGACTGTACATCTCCCTGGTGACCTGGCAGCGCCCAGATG
3,test_deletion_1_4,TGCCTGGAGGTGTCTGGGTCCCTCCCCCACCCGACTACTTCACTCTCTGTCCTCTCTGCCCAGGAGCCCAGGATGTGCGAGTTCAAGTGGCTACGGCCGACTG(-TC)CTCTCTGCCCAGGGTGCGAGGCCAGCTCGGGGGCACCGTGGAGCTGCCGTGCCACCTGCTGCCACCTGTTCCTGGACTGTACATCTCCCTGGTGACCTGGCAGCGCCCAGATG
4,test_deletion_1_5,TGCCTGGAGGTGTCTGGGTCCCTCCCCCACCCGACTACTTCACTCTCTGTCCTCTCTGCCCAGGAGCCCAGGATGTGCGAGTTCAAGTGGCTACGGCCGACTGT(-CC)TCTCTGCCCAGGGTGCGAGGCCAGCTCGGGGGCACCGTGGAGCTGCCGTGCCACCTGCTGCCACCTGTTCCTGGACTGTACATCTCCCTGGTGACCTGGCAGCGCCCAGATG
5,test_deletion_1_6,TGCCTGGAGGTGTCTGGGTCCCTCCCCCACCCGACTACTTCACTCTCTGTCCTCTCTGCCCAGGAGCCCAGGATGTGCGAGTTCAAGTGGCTACGGCCGACTGTC(-CT)CTCTGCCCAGGGTGCGAGGCCAGCTCGGGGGCACCGTGGAGCTGCCGTGCCACCTGCTGCCACCTGTTCCTGGACTGTACATCTCCCTGGTGACCTGGCAGCGCCCAGATG
6,test_deletion_1_7,TGCCTGGAGGTGTCTGGGTCCCTCCCCCACCCGACTACTTCACTCTCTGTCCTCTCTGCCCAGGAGCCCAGGATGTGCGAGTTCAAGTGGCTACGGCCGACTGTCC(-TC)TCTGCCCAGGGTGCGAGGCCAGCTCGGGGGCACCGTGGAGCTGCCGTGCCACCTGCTGCCACCTGTTCCTGGACTGTACATCTCCCTGGTGACCTGGCAGCGCCCAGATG
7,test_deletion_1_8,TGCCTGGAGGTGTCTGGGTCCCTCCCCCACCCGACTACTTCACTCTCTGTCCTCTCTGCCCAGGAGCCCAGGATGTGCGAGTTCAAGTGGCTACGGCCGACTGTCCT(-CT)CTGCCCAGGGTGCGAGGCCAGCTCGGGGGCACCGTGGAGCTGCCGTGCCACCTGCTGCCACCTGTTCCTGGACTGTACATCTCCCTGGTGACCTGGCAGCGCCCAGATG
8,test_deletion_1_9,TGCCTGGAGGTGTCTGGGTCCCTCCCCCACCCGACTACTTCACTCTCTGTCCTCTCTGCCCAGGAGCCCAGGATGTGCGAGTTCAAGTGGCTACGGCCGACTGTCCTC(-TC)TGCCCAGGGTGCGAGGCCAGCTCGGGGGCACCGTGGAGCTGCCGTGCCACCTGCTGCCACCTGTTCCTGGACTGTACATCTCCCTGGTGACCTGGCAGCGCCCAGATG
9,test_deletion_1_10,TGCCTGGAGGTGTCTGGGTCCCTCCCCCACCCGACTACTTCACTCTCTGTCCTCTCTGCCCAGGAGCCCAGGATGTGCGAGTTCAAGTGGCTACGGCCGACTGTCCTCT(-CT)GCCCAGGGTGCGAGGCCAGCTCGGGGGCACCGTGGAGCTGCCGTGCCACCTGCTGCCACCTGTTCCTGGACTGTACATCTCCCTGGTGACCTGGCAGCGCCCAGATG


### Batch mode:

In [104]:
# define input and output paths and filenames
inputpath = './input/'
# check input_flexible_mutations_testfile.csv for details about formatting the input file; 
# #required columns: [Name, sequence_with_brackets, edit_type, insert, deletion_length, step]. insert and deletion_length are optional depending on the edit_type
inputfilename = 'input_flexible_mutations_testfile.csv' 
outputpath = './output/'
summaryoutputfilename = 'summarized_flexible_mutations_outputfile.csv'
#

# run flexible mutation input generator
inputfiledf, outputfiledf = flexible_mutation_input_generator(inputpath, inputfilename, outputpath, summaryoutputfilename)

# continue with running PRIDICT2 (outside of this notebook) with the 'summarized_flexible_mutations_outputfile.csv' as input file
# example command: python pridict2_pegRNA_design.py batch --input-dir ./addons/flexible_mutations/output --input-fname summarized_flexible_mutations_outputfile.csv --output-dir ./predictions

# Optional but NOT RECOMMENDED: run PRIDICT2 from within this notebook. (uncomment !python command below)
# Caveat: takes a LONG time to run; we recommend running it separately via commandline
# !python ../../pridict2_pegRNA_design.py batch --input-dir ./output --input-fname outputfile.csv --output-dir ../../predictions

test_insertion_N
Saved 76 PRIDICT2 input sequences to ./output/individual_test_insertion_N_flexible.csv

test_insertion_stop
Saved 7 PRIDICT2 input sequences to ./output/individual_test_insertion_stop_flexible.csv

test_deletion_1
Saved 17 PRIDICT2 input sequences to ./output/individual_test_deletion_1_flexible.csv

**********
Saved 100 PRIDICT2 input sequences to summary file ./output/summarized_flexible_mutations_outputfile.csv.


### Summarize PRIDICT2.0 predictions of flexible mutations after running PRIDICT2.0

- Only run this after you ran PRIDICT2.0 with the output batch file created above. 

- The code below summarizes all the predictions with different flexible mutations in one file, sorts this by K562 score and saves it as summary prediction file.

- For MMR-deficient context, change "sort_value" from "K562" to "HEK".

- From this summary file, we suggest to take e.g. the top 5 pegRNAs and test these in your experimental setup

In [70]:
### Summarize PRIDICT2 predictions:
pridict2_predictions_folder = '../../predictions/'  # folder with all PRIDICT2 predictions (default "../predictions/")
summary_prediction_output_folder = './summarized_flexible_mutations_pridict2_predictions/' # folder where summarized prediction files will be saved
sort_value = 'K562' # change to "HEK" for sorting to MMR-deficient cell line prediction

# filelist of all .csv files in pridict2_predictions_folder:
filelist = [f for f in os.listdir(pridict2_predictions_folder) if f.endswith('pegRNA_Pridict_full.csv')]

for index, row in inputfiledf.iterrows():
    sequence_name = row['Name']
    print(sequence_name)
    # get all files in filelist that start with the sequence_name
    sequence_files = [f for f in filelist if f.startswith(sequence_name)]
    # read all files and concatenate them
    all_files = []
    for file in sequence_files:
        all_files.append(pd.read_csv(pridict2_predictions_folder+file))
    all_files_df = pd.concat(all_files, ignore_index=True)
    # sort all_files_df by column "PRIDICT2_0_editing_Score_deep_..." (from highest to lowest)
    all_files_df = all_files_df.sort_values(by='PRIDICT2_0_editing_Score_deep_'+sort_value, ascending=False)
    # save concatenated file
    all_files_df.to_csv(summary_prediction_output_folder+sequence_name+'_all_flexible_predictions.csv')

test_insertion_N
test_insertion_stop
test_deletion_1
