# Design CTP-15 high-resolution tracing of Cerebellin2

by Pu Zheng

2021.10.23

Design loci for Cbln2 and its known enhancers


<a id='top'></a>
# Table of Contents

> 0. [Minimum required packages and settings](#0)
>>
>> 0.1: [load required packages](#0.1)
>>
>> 0.2: [required parameters for library](#0.2)
>
>1. [Extract region sequences](#1)
>>
>>1.1 [load gene list](#1.1)
>>
>>1.2 [load gene TSS sequences](#1.2)
>
>2. [Design a sequential tracing encoding scheme](#2)
>>
>>2.1 [generate gene_2_readout](#2.1)
>
>3. [Design targeting sequences](#3)
>>
>>3.1 [Construct count table with all the 17-mers in the genome](#3.1)
>>
>>3.2 [Design probes targeting sequences by pb_designer](#3.2)
>>
>>3.3 [Summarize into pb_dict](#3.3)
>
>4. [Assemble probes](#4)
>>
>>4.1 [Load region_2_readouts](#4.1)
>>
>>4.2 [Load primers and readouts](#4.2)
>>
>>4.3 [Assemble probes](#4.3)
>
>5. [Check probe quality](#5)
>>
>>5.1 [Basic quality_checks](#5.1)
>>
>>5.2 [Blast screening](#5.2)
>>
>>5.3 [Reload saved probes and check length](#5.3)


<a id='0'></a>
# 0. Minimum required packages and parameters


[back to top](#top)

<a id='0.1'></a>
## 0.1 load required packages

[back to top](#top)

In [3]:
%run "..\..\Startup_py3.py"
sys.path.append(r"..\..\..\..\Documents")

import ImageAnalysis3 as ia
%matplotlib notebook

from ImageAnalysis3 import *
print(os.getpid())

# library design specific tools
from ImageAnalysis3.library_tools import LibraryDesigner as ld
from ImageAnalysis3.library_tools import LibraryTools as lt
# biopython imports
from Bio import SeqIO
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
from Bio.Blast.Applications import NcbiblastnCommandline
from Bio.Blast import NCBIXML

34976


<a id='0.2'></a>
## 0.2 required parameters for library

[back to top](#top)

In [54]:
## Some folders
# NEW mouse genome
genome_assembly = 'GRCm39'
reference_folder = os.path.join(r'\\10.245.74.212\Chromatin_NAS_2\Chromatin_Libraries\Genomes\mouse', f'{genome_assembly}_ensembl')
genome_folder = os.path.join(reference_folder, 'Genome')
# Library directories
pool_folder = r'\\10.245.74.212\Chromatin_NAS_2\Chromatin_Libraries\CTP-15_MeCP2_genes'

In [6]:
resolution = 1000
flanking = 5000

print(f"resolution: {resolution}, flanking size: {flanking}")
# folder for sub-pool
library_folder = os.path.join(pool_folder, f'Cbln2')
if not os.path.exists(library_folder):
    print(f"create library folder: {library_folder}")
    os.makedirs(library_folder)
# folder for fasta sequences
sequence_folder = os.path.join(library_folder, 'sequences')
if not os.path.exists(sequence_folder):
    print(f"create sequence folder: {sequence_folder}")
    os.makedirs(sequence_folder)
# folder to save result probes
report_folder = os.path.join(library_folder, 'reports')
if not os.path.exists(report_folder):
    print(f"create report folder: {report_folder}")
    os.makedirs(report_folder)
    
print(f"-- library_folder: {library_folder}")
print(f"-- sequence_folder: {sequence_folder}")
print(f"-- report_folder: {report_folder}")

resolution: 1000, flanking size: 5000
create library folder: \\10.245.74.212\Chromatin_NAS_2\Chromatin_Libraries\CTP-15_MeCP2_genes\Cbln2
create sequence folder: \\10.245.74.212\Chromatin_NAS_2\Chromatin_Libraries\CTP-15_MeCP2_genes\Cbln2\sequences
create report folder: \\10.245.74.212\Chromatin_NAS_2\Chromatin_Libraries\CTP-15_MeCP2_genes\Cbln2\reports
-- library_folder: \\10.245.74.212\Chromatin_NAS_2\Chromatin_Libraries\CTP-15_MeCP2_genes\Cbln2
-- sequence_folder: \\10.245.74.212\Chromatin_NAS_2\Chromatin_Libraries\CTP-15_MeCP2_genes\Cbln2\sequences
-- report_folder: \\10.245.74.212\Chromatin_NAS_2\Chromatin_Libraries\CTP-15_MeCP2_genes\Cbln2\reports


<a id='1'></a>
# 1 Extract region sequences

In CTP12, I have a code to extract region_dict from TSS of genes, and the information came from .gff3 files in the Ensembl archive.

Here, I directly generate this region_dict by manual input from a paper:

https://www.nature.com/articles/s41586-021-03952-y#Sec7
:

I blasted the primer sequences in supplementary table 2

[back to top](#top)

-283

# gene
Chromosome 18: 86,729,235-86,736,408 forward strand.

GRCm39:CM001011.3

In [50]:
region_dicts = [
    {'Gene':'Cbln2_E1','Name':'Cbln2_E1','Chr':18,'Start':86694187,'End':86694423,'Region':'18:86694187-86694423','Strand':'+'},
    {'Gene':'Cbln2_E2','Name':'Cbln2_E2','Chr':18,'Start':86742014,'End':86742297,'Region':'18:86742014-86742297','Strand':'+'},
    {'Gene':'Cbln2','Name':'Cbln2','Chr':18,'Start':86728002,'End':86736408,'Region':'18:86728002-86736408','Strand':'+'},
]
region_dicts_file = os.path.join(sequence_folder, 'region_dicts.pkl')
if os.path.exists(region_dicts_file):
    print(f"region_dicts_file:{region_dicts_file} already exists, skip.")
else:
    print(f"saving region_dicts to file: {region_dicts_file}")
    pickle.dump(region_dicts, open(region_dicts_file, 'wb'))

region_dicts_file:\\10.245.74.212\Chromatin_NAS_2\Chromatin_Libraries\CTP-15_MeCP2_genes\Cbln2\sequences\region_dicts.pkl already exists, skip.


In [33]:
reload(library_tools.references)
reload(library_tools.sequences)
# load all ref sequences
seq_rd = library_tools.sequences.sequence_reader(genome_folder, 
                                                 flanking=flanking, resolution=resolution,
                                                 auto_load_ref=True)
# Load sequences
for _dict in region_dicts:
    seq_rd.find_sequence_for_region(_dict, )
# save sequences
seq_rd.save_sequences(sequence_folder)

-- load sequence: 1, size=195154279
-- load sequence: 10, size=130530862
-- load sequence: 11, size=121973369
-- load sequence: 12, size=120092757
-- load sequence: 13, size=120883175
-- load sequence: 14, size=125139656
-- load sequence: 15, size=104073951
-- load sequence: 16, size=98008968
-- load sequence: 17, size=95294699
-- load sequence: 18, size=90720763
-- load sequence: 19, size=61420004
-- load sequence: 2, size=181755017
-- load sequence: 3, size=159745316
-- load sequence: 4, size=156860686
-- load sequence: 5, size=151758149
-- load sequence: 6, size=149588044
-- load sequence: 7, size=144995196
-- load sequence: 8, size=130127694
-- load sequence: 9, size=124359700
-- load sequence: MT, size=16299
-- load sequence: X, size=169476592
-- load sequence: Y, size=91455967
-- load sequence: JH584299.1, size=953012
-- load sequence: GL456233.2, size=559103
-- load sequence: JH584301.1, size=259875
-- load sequence: GL456211.1, size=241735
-- load sequence: GL456221.1, size=206

<a id='2'></a>
# 2 Design a sequential tracing encoding scheme

for sorted gene, assign 2 of identical unique readouts for this gene

[back to top](#top)

<a id='2.1'></a>
## 2.1 generate region_2_readout

[back to top](#top)

<a id='3'></a>
# 3 Design targeting sequences

[back to top](#top)

<a id='3.1'></a>
## 3.1 Construct count table with all the 17-mers in the genome

Create new count tables if you don't have pre-built 17-mer

This library requires mm10 (GRCm38) genome

[back to top](#top)

In [53]:
reload(library_tools.design)
overwrite_table = False
library_type = 'DNA'

### 3.1.1 construct map for whole genome

In [57]:
genome_table_file = os.path.join(reference_folder, f'{genome_assembly}_genome_17w.npy')

if not os.path.exists(genome_table_file) or overwrite_table:
    # genome
    _genome_filenames = [os.path.join(genome_folder, _fl) 
         for _fl in os.listdir(genome_folder) 
         if _fl.split(os.extsep)[-1]=='fasta' or _fl.split(os.extsep)[-1]=='fa']
    print(len(_genome_filenames))

    ct = library_tools.design.countTable(word=17,save_file=genome_table_file, 
                       sparse=False)
    ct.verbose=True
    
    ct.read(_genome_filenames) # read sequences from fasta files

    ct.consume_loaded(num_threads=32) # convert sequences into integers

    ct.complete(verbose=True)

    ct.save()

    # clear RAM if contructed countable 
    del(ct)

23
- Start multi-processing comsume 61 sequences 24 threads, finish in 494.486s
- Total sequences loaded: 2728221475
Time to compute unique and clip: 741.4039568901062
Time to update matrix: 28.07893443107605
- start saving to file: \\10.245.74.212\Chromatin_NAS_2\Chromatin_Libraries\Genomes\mouse\GRCm39_ensembl\GRCm39_genome_17w.npy


KeyboardInterrupt: 

In [58]:
    ct.save()

- start saving to file: \\10.245.74.212\Chromatin_NAS_2\Chromatin_Libraries\Genomes\mouse\GRCm39_ensembl\GRCm39_genome_17w.npy


### 3.1.2 construct map for transcriptome

skip for this library

### 3.1.3 construct map for repeats from RepBase

directly copy from previous library

### 3.1.4 construct map for previous library to avoid conflict

In [61]:
from tqdm import tqdm
# mouse genome
ctp11_folder = r'\\10.245.74.212\Chromatin_NAS_2\Chromatin_Libraries\CTP-11_brain\mouse_genome_1000'

ctp11_table_file = os.path.join(reference_folder, 'ctp11_genome_17w.npy')

if not os.path.exists(ctp11_table_file) or overwrite_table:
    # ctp11
    _ctp11_filenames = [os.path.join(ctp11_folder, _fl) 
         for _fl in os.listdir(ctp11_folder) 
         if _fl.split(os.extsep)[-1]=='fasta' or _fl.split(os.extsep)[-1]=='fa']
    # only keep the final_pool_probes
    _ctp11_filenames = [_fl for _fl in _ctp11_filenames if 'final_pool_probes' in _fl]
    
    ct = library_tools.design.countTable(word=17,save_file=ctp11_table_file, 
                       sparse=False)
    ct.verbose=True

    ct.read(_ctp11_filenames) # read sequences from fasta files

    ct.consume_loaded(num_threads=24) # convert sequences into integers

    ct.complete(verbose=True)

    ct.save()
    
    # clear RAM if contructed countable 
    del(ct)

- Start multi-processing comsume 173891 sequences 24 threads, finish in 100.184s
- Total sequences loaded: 21910266
Time to compute unique and clip: 5.280938148498535
Time to update matrix: 7.0782787799835205
- start saving to file: \\10.245.74.212\Chromatin_NAS_2\Chromatin_Libraries\Genomes\mouse\GRCm39_ensembl\ctp11_genome_17w.npy


In [None]:
from tqdm import tqdm
# mouse genome
ctp11_folder = r'\\10.245.74.212\Chromatin_NAS_2\Chromatin_Libraries\CTP-11_brain\mouse_genome_1000'

ctp11_table_file = os.path.join(reference_folder, 'ctp11_genome_17w.npy')

if not os.path.exists(ctp11_table_file) or overwrite_table:
    # ctp11
    _ctp11_filenames = [os.path.join(ctp11_folder, _fl) 
         for _fl in os.listdir(ctp11_folder) 
         if _fl.split(os.extsep)[-1]=='fasta' or _fl.split(os.extsep)[-1]=='fa']
    # only keep the final_pool_probes
    _ctp11_filenames = [_fl for _fl in _ctp11_filenames if 'final_pool_probes' in _fl]
    
    ct = library_tools.design.countTable(word=17,save_file=ctp11_table_file, 
                       sparse=False)
    ct.verbose=True

    ct.read(_ctp11_filenames) # read sequences from fasta files

    ct.consume_loaded(num_threads=24) # convert sequences into integers

    ct.complete(verbose=True)

    ct.save()
    
    # clear RAM if contructed countable 
    del(ct)

In [65]:
test_ct = ld.countTable(sparse=False, save_file=ctp11_table_file)
test_ct.load()

In [67]:
%%timeit
test_ct.get('ATCGATCGATCGATCGA')

9.83 µs ± 24.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)


### 3.1.5 construct map for isoforms for this library (only for RNA library)

In [68]:
from tqdm import tqdm
# isoform
# RNA specific: get isoform files
if library_type is 'RNA':
    isoform_folder = os.path.join(sequence_folder, 'isoforms')

    isoform_table_file = os.path.join(isoform_folder, 'library_isoform_17w.npy')

    if not os.path.exists(isoform_table_file) or overwrite_table:
        # isoform
        isoform_files = [os.path.join(isoform_folder, _fl) 
                         for _fl in os.listdir(isoform_folder) 
                         if _fl.split(os.extsep)[-1] == 'fasta' or _fl.split(os.extsep)[-1] == 'fa']
        print(len(isoform_files))

        ct = library_tools.design.countTable(word=17,save_file=isoform_table_file, 
                           sparse=False)
        ct.verbose=True

        ct.read(isoform_files) # read sequences from fasta files

        ct.consume_loaded(num_threads=24) # convert sequences into integers

        ct.complete(verbose=True)

        ct.save()

        # clear RAM if contructed countable 
        del(ct)

<a id='3.2'></a>
## 3.2 Design probes targeting sequences by pb_designer

[back to top](#top)

In [69]:
# requires pre_defined genome_folder and library_folder
# Indices
genome_index = os.path.join(reference_folder, f'{genome_assembly}_genome_17w.npy')

repeat_index = os.path.join(reference_folder, 'Repbase_v2603_repeat_17w.npy')

ref_library_index = os.path.join(reference_folder, 'ctp11_genome_17w.npy')

if library_type is 'RNA':
    isoform_folder = os.path.join(sequence_folder, 'isoforms')
    isoform_index = os.path.join(isoform_folder, 'library_isoform_17w.npy')

# get input files 
input_files = glob.glob(os.path.join(sequence_folder, '*.fasta'))

print(f"{len(input_files)} regions loaded to design probes.")

if not os.path.exists(report_folder):
    os.makedirs(report_folder)
    
# filename to save probe reports
save_file = os.path.join(report_folder, f'merged_probes.pbr')
print(f"target save filename: {save_file}")

41 regions loaded to design probes.
target save filename: \\10.245.74.212\Chromatin_NAS_2\Chromatin_Libraries\CTP-15_MeCP2_genes\Cbln2\reports\merged_probes.pbr


### 3.2.1 create pb_designer class

In [72]:
use_ref_library = True

sequence_dict = {
    'file':input_files,
}
map_dict = {
    'genome': {'file':genome_index,'rev_com':False,'two_stranded':True},
    #'transcriptome':{'file':transcriptome_index,'rev_com':True,'two_stranded':False},
    'rep_genome':{'file':repeat_index,'rev_com':False,'two_stranded':True},
}

params_dict = {
    'word_size':17,
    #'pb_len':40,
    'buffer_len':-2, # distance between probes
    'max_count':2**16-1,
    'check_on_go': False, # whether automatically check probes
    'auto': False, # whether automatically convert reference maps
    }

check_dict={
    'rep_genome': 0,
    'gc':[0.25,0.75],
    'masks': ['AAAAA','TTTTT','CCCCC','GGGGG', # Quartet-repeats
              'GAATTC','CTTAAG', # EcoRI sites
              'GGTACC','CCATGG',], # KpnI sites
           }

if library_type == 'DNA':
    sequence_dict['rev_com'] = False
    sequence_dict['two_stranded'] =True # Allow design on both DNA strands
    
    map_dict['self_sequences'] = {'file':input_files,'force_list':True,'rev_com':False,'two_stranded':True}
    
    params_dict['pb_len'] = 40
    
    check_dict['tm'] = 46+0.62*50-5 # 47C incubation + 50% formamide - 5C for the melting curve
    check_dict[('genome','self_sequences')] = int( (params_dict['pb_len'] - params_dict['word_size'] + 1) * 1.8 )
    #check_dict['transcriptome'] = int( (params_dict['pb_len'] - params_dict['word_size'] + 1) * 0.8 )
    
    
elif library_type == 'RNA':
    sequence_dict['rev_com'] = True
    sequence_dict['two_stranded'] =False
    
    map_dict['isoforms'] = {'file':isoform_index,'force_list':True,'rev_com':True,'two_stranded':False}
    
    params_dict['pb_len'] = 30
    
    check_dict['tm'] = 37+0.62*30-5
    check_dict[('transcriptome','isoforms')] = int( (params_dict['pb_len'] - params_dict['word_size'] + 1) * 0.8 )
    check_dict['genome'] = int( (params_dict['pb_len'] - params_dict['word_size'] + 1) * 1.8 )
    
# add ref library if applicable
if use_ref_library:
    
    map_dict['ref_library'] = {'file':ref_library_index,'rev_com':True,'two_stranded':True}
    
    check_dict['ref_library'] = int( (params_dict['pb_len'] - params_dict['word_size'] + 1) * 0.8 )

In [73]:
reload(library_tools)
reload(library_tools.design)

pb_designer = library_tools.design.pb_reports_class(
    sequence_dic = sequence_dict,
    map_dic = map_dict,
    params_dic = params_dict,
    check_dic = check_dict,
    save_file=save_file,
)
print(pb_designer)
pb_designer.load_from_file(load_probes_only=True)


Probe designer derived from Bogdan Bintu:
https://github.com/BogdanBintu/ChromatinImaging/blob/master/LibraryDesign/LibraryDesigner.py
by Pu Zheng, 2020.11

Major changes:
    1. allow design of two strands
    2. separate reverse_complement (rev_com) and from two strands (two_stranded) as 
    two different inputs for map_dic and sequence_dic
    3. replace 'local_genome' with 'self_sequences' to be more explicit, and only 
    exclude the counts for the corresponding self_sequence within each input. 

Key information:
    - number of input_sequence(s): 41
    - save_file location: \\10.245.74.212\Chromatin_NAS_2\Chromatin_Libraries\CTP-15_MeCP2_genes\Cbln2\reports\merged_probes.pbr

- Fail to load from savefile: \\10.245.74.212\Chromatin_NAS_2\Chromatin_Libraries\CTP-15_MeCP2_genes\Cbln2\reports\merged_probes.pbr, file doesn't exist.


False

In [None]:
###  calculate probe reports