# Probe design pipeline

### Some code to design probes for star-trex.

Import some packages

In [1]:
import os
import pandas as pd
import sys
sys.path.insert(1, os.path.abspath('..'))

Define the working directory here

In [2]:
work_dir = "/Users/leonievb/Library/CloudStorage/OneDrive-Personal/Postdoc/Data/Probe_data/"

First, we need to enter some information to be able to design probes the way we want it. Modify paths as needed.

In [19]:
#Insert Path to a CSV file with list of genes to be used for probe design. 
#Expects the following format with no header:
# gene1
# gene2
genes_path = os.path.join(work_dir, "probe_gene_list.csv")

#Path to CSV file that contains the database with calculated target sequences within the transcriptome. 
#While format is not fully specified, it must have a header row and one column with name "primer" for the 
#primer target sequences and one column named "padlock" for the padlock target sequence
#        ...     ,primer,   ...   ,padlock, ...
#                ,pr_seq1,        ,pa_seq1,
#                ,pr_seq2,        ,pa_seq2,
probedb_path = os.path.join(work_dir, "M_musculus_filtered_probe.csv")

#Path to desired location and name of the output file
output_path = os.path.join(work_dir, "probes_all.csv")

#Path to a CSV file that provides one geneID for each gene. Either these geneIDs
#will be used to create the probes, OR if create_geneids = True, it will avoid these geneIDs as
#they will be considered as already existing. If None is provided and create_geneids not a number,
#an error will be thrown. While format is not fully specified, it must have a header row and one 
#column with name "gene" for the gene symbol and one column named "geneid" for the geneID:
# ...     ,gene,   ...   ,geneID, ...
#         ,gene1,        ,geneID1,
#         ,gene2,        ,geneID2,
geneids_path = os.path.join(work_dir, "probes_cloneIDs.csv")
create_geneids = 5

#Maximum number of probes to design per gene
probe_max = 4

#Exclude "TA" sequences in the spacer region between primer and padlock targer
exclude_TA = True

In [20]:
from importlib import reload
from src import probe_designer
reload(probe_designer)
from src.probe_designer import probe_designer

df = probe_designer(genes_path=genes_path, probedb_path=probedb_path, output_path=output_path, 
                    geneids_path = geneids_path, create_geneids= create_geneids, probe_max = probe_max,
                    exclude_TA=exclude_TA)

Gene Grfa2 is not in at least one of the databases and was skipped
Gene RP23-231J2.1 is not in at least one of the databases and was skipped


If there are genes for which no probes could be designed, you can see their names above. Please examine why they could not be included in probe design and correct if needed

There will be a stored .csv file in your indicated output location but you can also have a look at the results here

In [10]:
df

Unnamed: 0,gene,geneID,padlockID,padlock_seq,primerID,primer_seq
0,Cpne4,CGGAA,Cpne4_00,AAAATACTGTTGAGTCGCGTCATCGTAATTATTACCGGAACATACA...,Cpne4_10,ATCACAACCTCTGTTCGATGCACATATTTTTATCTT
1,Cpne4,CGGAA,Cpne4_01,AAAATAATTCCGTCGTCACCGTCCAAATTATTACCGGAACATACAC...,Cpne4_11,TCTCCCTTGGGTGACCTTAGTATTTTTATCTT
2,Cpne4,CGGAA,Cpne4_02,AAAATAGTACCTGTTTCCCTTCCATGAATTATTACCGGAACATACA...,Cpne4_12,GGGGTTGATGCATTCCCACTATTTTTATCTT
3,Cpne4,CGGAA,Cpne4_03,AAAATAGAACGGAAAGTTGGACAGCCAATTATTACCGGAACATACA...,Cpne4_13,AGCTCAAAGACCAAGCGATTTATTTTTATCTT
4,Fezf2,TACCC,Fezf2_00,AGTCTATAGTGTTTTAGAAGTGGCCGAATTATTACTACCCCATACA...,Fezf2_10,ATGCGCTCGATAGAGAAAGTAGACTTATCTT
...,...,...,...,...,...,...
1827,Lbhd2,ACAGA,Lbhd2_03,ACGTTACAGAGCCAAGGGCCCTTCTAATTATTACACAGACATACAC...,Lbhd2_13,ACAATAGAGGGCAGTCGCTGTAACGTTATCTT
1828,Gm17750,TGTCT,Gm17750_00,AACCTAATCACTCAGTGCTACATGGCAATTATTACTGTCTCATACA...,Gm17750_10,AGACCTTGTCTAGAATTGGCATGTAGGTTTATCTT
1829,Gm17750,TGTCT,Gm17750_01,AACCTAAATCTTCACCCAGGATGGTGTAATTATTACTGTCTCATAC...,Gm17750_11,AGCACATCCACATTCAATTGCAATAGGTTTATCTT
1830,Gm17750,TGTCT,Gm17750_02,AACCTATCCATATCCAGGAGCACAGAATTATTACTGTCTCATACAC...,Gm17750_12,AGCTCTTGAGGAGAGATTAACATAGGTTTATCTT
