# Oligo Design

For Nucleaseq, we want to design sequences with the following basic structure:

| Left primer | Left BC | Left buffer | Target$_n$ | Right buffer | Right fill | Right BC | Right primer |
| - | - | - | - | - | - | - | - |

Where Target$_n$ is from the set of all desired modified target sequences. 

The set of Target$_n$'s is specific to each experiment, but there are some relatively standard sets. For example, most experiments will wish to include all single- and double-mismatch sequences. While on the other hand, the PAM structure for each CRISPR variant is different and may require custom sequence generation. And other nucleases may have entirely different needs.

To handle this, we have a number of functions for standard modifications in design.py, which can be called below, while at the same time we have a space explicitly reserved for custom sequence generation: "Custom sequence functions". After the set of target-generation functions is complete, they need added to the "Construct Sequences" section below, in the manner shown by the included examples. Go through this section carefully to verify the set of included sequences is correct.

The "Run parameters" section needs updated according to the experimental requirements, as well. One parameter which needs specified is a list all canonical cut positions along the target sequence. This needs to be a python integer with ".5" after it to indicate the cut position. For instance, 18.5 cuts between python indices 18 and 19. The lower portion of this notebook needs this information to find appropriate primer sequences.

Finally, the user is expected to adjust the primers as necessary to fit their experimental conditions. See the "Replace Primers" section for details.

## Sequence Motifs

### For this notebook, we want the following sequences and modifications:

* Target D - TnpB (IsDra2)
* single mismatches
* double mismatches
* single insertions
* double insertions
* single deletions 
* double deletions
* scanning mismatch regions (using the complement) for 3 bp to 24 bp length regions
* Perfect target with different buffer regions
* Perfect target with various barcodes
* Random negative control seqs
* Various 5N PAM
* single mm and single ins seqs
* single mm and single del seqs

In [2]:
import time
notebook_start_time = time.time()

%matplotlib inline
%load_ext autoreload
%autoreload 2

import sys
import yaml
import itertools
import random
import math
import logging
import scipy.misc
import editdistance
import numpy as np
from copy import deepcopy
from collections import defaultdict, Counter
from Bio.Seq import Seq
from freebarcodes.seqtools import simple_hamming_distance
import freebarcodes.seqtools as fbseqtools

from nucleaseq import design, seqtools
from nucleaseq.NucleaSeqOligo import NucleaSeqOligo
from nucleaseq.equalmarginalseqs import generate_clean_random_eqmarg_seqs

#import freebarcodes. Need to show jupyter where to find it.
sys.path.append('/home/joneslab/nucleaseq/freebarcodes')

master_seed = 37
random.seed(master_seed)
rand_seeds = [random.randint(0, 100) for i in range(10)]

# Run parameters

In [3]:
# Library features
total_desired_seqs = int(input('What library size do You desire? Defaut: 12472') or '12472')
n_err_detect_seqs = int(input('How many negative control sequences? Default: 150') or '150')
min_perfect_target_copies = int(input('How many perfect targets do You desire? Default: 50') or '50')
bases = 'ACGT'

# Target features
targets_fpath = '/home/joneslab/nucleaseq/resources/targets.yml'#TnpB-targets.yml'
target_name = 'D'
target_pam = 'TTTA' #when you request the PAM, only allow ACGT.
pam_end = '5' #when you request the PAM end, only allow 5' or 3' and state that it must be given from 5'->3' 
#it will not be flipped when on 3'
interesting_pams = ['TTTA'] # PAMs to investigate if you're interested in them. Originally: 'NNGAN'
abs_cannonical_cut_sites = [13, 22] #How many bases away from the PAM will the sequence be cut

#Barcode features
barcodes_fpath = '/home/joneslab/nucleaseq/freebarcodes/barcodes/barcodes18-2.txt'#barcodes17-2.txt'#_subset1of3.txt'
barcodes = [line.strip() for line in open(barcodes_fpath)]
bad_substrs = [target_pam, 'ATCAA'] # Forbidden subseqs in buffers or primers (here TnpB TAMs)

# Primer features
min_primer_len = 18  # First length to try. Will go smaller if possible. Adjust this if notebook too slow.
max_primer_len = 25

#Computational limits
nprocs = 2 #20 #Is this the number of processors used for computation

log = logging.getLogger()
log.addHandler(logging.StreamHandler())
log.setLevel(logging.INFO)
log.info(f'''The provided barcodes file has {len(barcodes)} barcodes, 
which could create {len(barcodes)/2} barcode pairs. It is important that the      
amount of barcode pairs is not less than {total_desired_seqs}- the total_desired_seqs'''

SyntaxError: invalid syntax (<ipython-input-3-5cd0a7794012>, line 33)

In [4]:
# Load the file containing the targets used for library generation.
targets = yaml.safe_load(open(targets_fpath))
target = targets[target_name]
target_no_pam = target[len(target_pam):]
target_seeds = [target[:12]]

fudge_factor = 5             # How far from a cannonical cut site are possible cuts
min_buffer_len = 5           # Min length of target-flanking buffers

print (f'Consider cut sites within {fudge_factor} bp of pamtarg positions: {abs_cannonical_cut_sites}')
print (f'Target {target_name} ({len(target)} bp): {target}')
print ('Target seeds:')
for target_seed in target_seeds:
    print (f'    ({len(target_seed)} bp): {target_seed}')
    
print 'Cannonical cut sites:'
for ccs in abs_cannonical_cut_sites:
    ccs = int(math.ceil(ccs))
    #print(target)
    print (f'    [{target_pam}]{target_no_pam[:ccs]} x {target_no_pam[ccs:]}'

SyntaxError: invalid syntax (<ipython-input-4-6e224d241d3e>, line 10)

In [5]:
def cut_sites(spacer, target_pam, pam_end = '5', cut_sites = [13, 22]):
    #Shows where a cut in the sequence will be made
    outcome = []
    pam_end = str(pam_end)
    if pam_end == '5':
        for ccs in cut_sites:   
            cuts_5 = f'[{target_pam}]{spacer[:ccs]} X {spacer[ccs:]}'
            outcome.append(cuts_5)
    elif pam_end == '3':
        #elif is still not finished
        for ccs in cut_sites:   
            cuts_3 = f'{spacer[:-ccs]} X {spacer[-ccs:]}[{target_pam}]'
            outcome.append(cuts_3)
        print('Note: PAM is not flipped')
    return outcome

SyntaxError: invalid syntax (<ipython-input-5-420a09e06b3f>, line 7)