# CRISPEY Oligo Library Design - adapted for humanized yeast pilot
An assortment of oligos were generated from two sets of variant files and combined to get the final list being synthesized
- Fritz Roth curated variants list, taken direct from hYeast ORF fasta (library name: humanized_validated)
- Shi-An's compiled variants from GnomAD (library name: humanized)

Oligos from both sets were combined, with priority given to the Fritz Roth set. Barcode assignment was done collectively with the rest of the CRISPEY3 library (see other Jupyter notebook for CRISPEY3 library design)

## Import packages and functions

In [21]:
import os, random
import pandas as pd

crispey_libdesign_code_dir = os.path.expanduser('~/crispey-epistasis/lib_design/')
working_dir = os.path.expanduser("~/crispey3/humanized/")

os.chdir(crispey_libdesign_code_dir)
from extract_guides_functions import extract_guides_for_snps, \
                                     design_donor_for_SNP_guides, \
                                     rank_and_filter_SNP_guides, \
                                     generate_oligo_from_guide_donor_barcode, \
                                     write_output_oligos, \
                                     annotate_variants_by_VEPoutput

os.chdir(working_dir)
print("Current directory: {}".format(os.getcwd()))


Current directory: /home/users/rang/crispey3/humanized


## Design parameters

In [22]:
# lib_name = "humanized"
lib_name = "humanized_validated"

#####################################################
# guide design for SNPs
#####################################################

guide_length = 20
edit_max_distance_from_PAM5prime = 9
PAM_seq = 'GG' 
min_ok_Azimuth_score_SNP_guides = 0
off_targets_min_mismatch_SNP_guides = 1 # increase this to filter guides that have off-targets with mismatches. 1 only filters out guides with perfect match off-targets

# BOWTIE_exe = "bowtie2"

#####################################################
# donor design for SNPs
#####################################################
agilent_homopolymer_max_len = 10

excluded_seqs = ['A' * agilent_homopolymer_max_len, 
                 'C' * agilent_homopolymer_max_len, 
                 'G' * agilent_homopolymer_max_len, 
                 'T' * agilent_homopolymer_max_len,
                 'GCATGC', # SphI cut site
                 'GGCGCGCC', # AscI cut site
                 'GCGGCCGC'] # NotI cut site

donor_length = 108
min_dist_5prime_arm = 30
min_dist_3prime_arm = 55

#####################################################
# barcode grouping parameters
#####################################################
barcodes_per_group = 118


## Input file names

In [23]:
#####################################################
# Input files
#####################################################
input_dir= working_dir + "Input/"

# input VCF
# input_snps_vcf_filename = input_dir+'humanized_yeast_variants.vcf'
input_snps_vcf_filename = input_dir+'humanized_yeast_variants_validated.vcf'

# pool assignments
input_vars_pool_assignment_filename = input_dir+'crispey3_vars_pool_assignment.txt'

# oligo design table - contains information about the other segments in the oligo
input_oligo_design_table_filename = input_dir+'crispey3_oligo_design_table.txt'

# SNP guides donor design table 
# contains set names and filtering to sort variants/guides for each set. (use filter_in and filter_out columns)
input_SNP_donor_design_table_filename = input_dir+'humanized_design_donor_for_snps.txt'

# programmed barcodes list
input_barcode_table_filename = input_dir+'12BP_PBCs_well_grouped.csv'

# #####################################################
# # Human genome reference files
# #####################################################
# input_genome_fasta_filename = os.path.expanduser("~/scratch/hg19/hg19.fa")
input_genome_fasta_filename = input_dir+'hyeast_orf.fa'

## Output file names

These are files names that are used and created during the pipeline.
Names may vary between different designs

In [24]:
###############################################################################################################
# Output files (no need to pass as argument -  depends on the output directory, PAM sequence and max edit distance)
###############################################################################################################
output_directory = working_dir + "Output/"
output_files_uniq_str = lib_name + "_" + PAM_seq + "_" + str(edit_max_distance_from_PAM5prime) + "bp"

# intermediate files during guide-donor-oligo design process
output_SNP_table_filename =    output_directory + "all_SNPs_" + output_files_uniq_str + "_SNP.tab"
output_guides_table_filename = output_directory + "all_SNPs_" + output_files_uniq_str + "_GUIDE.tab"
output_guides_with_features_table_filename = output_directory + "all_SNPs_" + output_files_uniq_str + "_GUIDE_withFeatures.tab"
output_SNP_donor_table_filename = output_directory + "all_SNPs_" + output_files_uniq_str + "_DONOR.tab"
output_guides_with_features_and_rank_table_filename = output_directory + "all_SNPs_" + output_files_uniq_str + "_GUIDE_withFeatures_withRank.tab"
output_oligos_table_filename = output_directory + "all_SNPs_" + output_files_uniq_str + "_OLIGO.tab"

# oligos to one table
output_oligo_for_production_nonuniq_filename = output_directory + "crispey_oligos_nonuniq_" + output_files_uniq_str + "_OLIGO.txt"
output_oligo_for_production_nonuniq_with_align_filename = output_directory + "crispey_oligos_nonuniq_" + output_files_uniq_str + "_OLIGO.txt"

output_oligo_for_production_uniq_filename = output_directory + "crispey_oligos_uniq_" + output_files_uniq_str + "_OLIGO.txt"
output_oligo_for_production_uniq_batch_prefix_filename = output_directory + "crispey_oligos_uniq_" + output_files_uniq_str + "_"

# # SNP table with VEP annotations
# output_SNP_withAnnotations_table_filename = output_directory + "all_SNPs_" + lib_name + "_vep_annotation.txt"


# Step 1: Design and extract all guides

In [25]:
# setting the random seed
random.seed(1)


In [26]:
extract_guides_for_snps(input_snps_vcf_filename, input_genome_fasta_filename, 
                        output_SNP_table_filename, output_guides_table_filename,
                        [PAM_seq], guide_length, edit_max_distance_from_PAM5prime,
                        var_id_prefix = "")


---------------------- extracting guides for SNPs -------------------------------
Finish parsing VCF: 130, found: 109, #guides: 265

---------------------------- Done extracting guides for SNPs --------------------------


In [27]:
# optional: filter guides that hit edge of sequences
guides_df = pd.read_csv(output_guides_table_filename, sep='\t')
# remove rows without complete guides
guides_df = guides_df.loc[~guides_df['guide_noPAM'].isna()]
guides_df = guides_df.loc[guides_df['guide_PAM_m4p3bp'].apply(len)==30,:]
# write back to file
guides_df.to_csv(output_guides_table_filename, sep='\t', index=False)


# Step 2: Add guides features using Azimuth

Azimuth is written in python 2, therefore the script for extracting guide features should be run in a python 2 environment with Azimuth and Bowtie2 (for example environemt see crispr2_7 requirement file)

In this environment run **crispey_add_guide_features.py** with:
- output_guides_table_filename
- output_guides_with_features_table_filename
- input_genome_fasta_filename (reference genome)
- list of genome fasta filenames for off-target search

Adjust the script's variables to CRISPEY library parameters before running!

# Step 3: Design donors for guides

In [28]:
###############################################################################################################
# design donor sequence for each guide (set names specified in donor_design_table) 
###############################################################################################################
out_SNP_donor_df = design_donor_for_SNP_guides(
    input_SNP_donor_design_table_filename, 
    output_SNP_table_filename, 
    output_guides_with_features_table_filename,
    input_genome_fasta_filename,
    donor_length, excluded_seqs, min_dist_5prime_arm, min_dist_3prime_arm,
    output_SNP_donor_table_filename)


Excluded seq: AAAAAAAAAA
Excluded seq: CCCCCCCCCC
Excluded seq: GGGGGGGGGG
Excluded seq: TTTTTTTTTT
Excluded seq: GCATGC
Excluded seq: GGCGCGCC
Excluded seq: GCGGCCGC
--------- designing donors according to line: 0  (# guide ids = 4)----------
set_name                 DHFR
filter_in            VAR:DHFR
filter_out               None
donor_mut_type        REF2ALT
donor_seq_offsets        [14]
Name: 0, dtype: object
-----------------------------------------------------------
--------- designing donors according to line: 1  (# guide ids = 30)----------
set_name                 DPAGT1
filter_in            VAR:DPAGT1
filter_out                 None
donor_mut_type          REF2ALT
donor_seq_offsets          [14]
Name: 1, dtype: object
-----------------------------------------------------------
--------- designing donors according to line: 2  (# guide ids = 24)----------
set_name                 NSDHL
filter_in            VAR:NSDHL
filter_out                None
donor_mut_type         REF2ALT


# Step 4: Filter and rank guides

In [29]:
###############################################################################################################
# add filter and ranking to the SNP guides (depends on having a SNP and a donor tables) 
###############################################################################################################
out_SNP_guides_withFandR_df = rank_and_filter_SNP_guides(
        input_guides_with_features_table_filename = output_guides_with_features_table_filename,
        input_SNP_table_filename = output_SNP_table_filename,
        input_donor_table_filename = output_SNP_donor_table_filename,
        output_guides_with_features_and_rank_table_filename = output_guides_with_features_and_rank_table_filename,
        off_targets_min_mismatch_SNP_guides = off_targets_min_mismatch_SNP_guides, 
        min_ok_Azimuth_score_SNP_guides = min_ok_Azimuth_score_SNP_guides, 
        edit_max_distance_from_PAM5prime = edit_max_distance_from_PAM5prime)

Saving : /home/users/rang/crispey3/humanized/Output/all_SNPs_humanized_validated_GG_9bp_GUIDE_withFeatures_withRank.tab


# Step 5: Write guides and donors into oligos
Assemble guide and donor into oligos according to oligo_design_table. Write N's into barcode segment of the oligos for now. The barcode sequence and pool number will be assigned in the next step.

Note: Barcode assignment is done collectively with other CRISPEY3 oligos. See crispey3_design_library Jupyter notebook

The pool numbers are assigned according to variant ID, specified in vars_pool_assignment table. The pool assignments were arranged separately. For humanized yeast oligos, assign each oligo to the following pool numbers based on the target gene:
- DHFR: 185
- DPAGT1: 186
- NSDHL: 187
- PGK1: 188
- PKLR: 189
- UROS: 190

In [30]:
oligo_all_df = generate_oligo_from_guide_donor_barcode(
    input_oligo_design_table_filename = input_oligo_design_table_filename,
    input_guide_table_filename = output_guides_with_features_and_rank_table_filename, 
    input_donor_table_filename = output_SNP_donor_table_filename,
    input_barcode_table_filename = None, # use None to fill N's in barcode segment
    input_guide_iloc = None, # option to filter out guides
    input_donor_iloc = None, # option to filter out donors
    group_size = barcodes_per_group,                   
    output_oligos_table_filename = output_oligos_table_filename)

Before filtering there are 256 guides and 243 donors
After filtering there are 256 guides and 243 donors
shared columns
['var_id', 'guide_id']
joining the guides and the donors by shared columns (guide_id) creates 243 oligos
No barcodes provided. Skipping barcode assignment.
parsing oligo 0 out of 243
Saving : /home/users/rang/crispey3/humanized/Output/all_SNPs_humanized_validated_GG_9bp_OLIGO.tab


# Step 6: Append humanized yeast oligos to rest of CRISPEY3 library for barcode assignment and export