
# Tandem Repeat analysis
#### Written by Jigar N. Bandaria

I have developed this pipeline to find tandem repeats in the human genome that can be used for CAS9 based imaging. This series of ipython notebooks, filters out sequences based on some criteria that we applied for imaging in the manuscript:
"Live cell imaging of low- and non-repetitive chromosome loci using CRISPR/Cas9", Qin et.al. Nature Comm. 

Follow the notebooks in the following order:
1. sgRNA analysis 1.
2. sgRNA analysis 2.
3. sgRNA analysis 3.
4. Counting sgRNA.
5. Remove reverse complement.
6. Recheck after remove rev comp.

The scripts are written to analyze sequences within a 10 kb window. The initial analysis to generate potential target sequence was done using a combination of Bowtie1, Samtools and Bedtools. The bash scripts are in the repository too. Use these to first generate the sequence and then use these notebooks.


In [1]:
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import subprocess as sp
import os

In [1]:
#The bash script generates a file containing all the sgRNA for each Chromosome. 
#Below I am running a test on Chromosome Y.
file_path = "/home/user/Desktop/nt_20/PAMs"

filename = file_path+"/PAM1_chrY.fa"
filename

'/home/user/Desktop/nt_20/PAMs/PAM1_chrY.fa'

In [4]:
#Below is the chromosome, position and sequence information for Chromosome Y
data = pd.read_table(filename,header=None,names=['chr_p','seq'])
data.head()

Unnamed: 0,chr_p,seq
0,chrY:10007-10024(-),ggttagggttagggtta
1,chrY:10008-10025(-),gggttagggttagggtt
2,chrY:10013-10030(-),ggttagggttagggtta
3,chrY:10014-10031(-),gggttagggttagggtt
4,chrY:10019-10036(-),ggttagggttagggtta


In [5]:
data['seq']=data.seq.str.upper() #convert evertything to upper case
data.head(10)
len(data)

1943852

In [6]:
#Checking for deuplicate sequences
match = data.duplicated('seq',keep=False)
data[match].head(10)

Unnamed: 0,chr_p,seq
0,chrY:10007-10024(-),GGTTAGGGTTAGGGTTA
1,chrY:10008-10025(-),GGGTTAGGGTTAGGGTT
2,chrY:10013-10030(-),GGTTAGGGTTAGGGTTA
3,chrY:10014-10031(-),GGGTTAGGGTTAGGGTT
4,chrY:10019-10036(-),GGTTAGGGTTAGGGTTA
5,chrY:10020-10037(-),GGGTTAGGGTTAGGGTT
13,chrY:10052-10069(-),ACCCACATCCTGCTGAT
18,chrY:10052-10069(+),ATCAGCAGGATGTGGGT
19,chrY:10102-10119(-),GTTGCCACTGCTGGCTC
31,chrY:10179-10196(+),TGCTATTGCTCACTCTT


In [7]:
data2=data[match]
len(data2) #These are total rows. Number of sequences that match.

944202

Below is a function that checks the sgRNA for each chromosome. It checks for duplicates. It then generates 2 files. One is for the duplicate sequences and other is for the unique sequences.

In [2]:
def dup_and_uniq(chr_num):
    file_path = "/home/user/Desktop/nt_17/PAMs/"
    filename = file_path+"PAM1_"+str(chr_num)
    print(filename)
    data = pd.read_table(filename,header=None,names=['chr_p','seq'])
    data['seq']=data.seq.str.upper()
    match = data.duplicated('seq',keep=False)
    
    out_dup = file_path+chr_num.split('.')[0]+'_dup.fa'
    out_uni = file_path+chr_num.split('.')[0]+'_uni.fa'
    
    dup = data[match]
    uni = data[~match]
    print ("#Duplicates, #Uniques")
    print (len(dup),len(uni))
    
    dup.to_csv(out_dup,header=None,index=None,sep='\t')
    uni.to_csv(out_uni,header=None,index=None,sep='\t')
    del data
    del uni
    del dup

In [3]:
chr_num = ['chrY.fa','chrX.fa','chr22.fa','chr21.fa','chr20.fa',
           'chr19.fa','chr18.fa','chr17.fa','chr16.fa','chr15.fa',
           'chr14.fa','chr13.fa','chr12.fa','chr11.fa','chr10.fa',
           'chr9.fa','chr8.fa','chr7.fa','chr6.fa','chr5.fa','chr4.fa',
           'chr3.fa','chr2.fa','chr1.fa']

for chromosome in chr_num:
    print ('==========================================================')
    print (chromosome)
    dup_and_uniq(chromosome)
    
#Below is the output. They also print the number of duplicates and uniques on each chromosome.

chrY.fa
/home/jigar/Desktop/nt_17/PAMs/PAM1_chrY.fa
#Duplicates, #Uniques
944202 999650
chrX.fa
/home/jigar/Desktop/nt_17/PAMs/PAM1_chrX.fa
#Duplicates, #Uniques
3312811 7775232
chr22.fa
/home/jigar/Desktop/nt_17/PAMs/PAM1_chr22.fa
#Duplicates, #Uniques
945872 3203895
chr21.fa
/home/jigar/Desktop/nt_17/PAMs/PAM1_chr21.fa
#Duplicates, #Uniques
506168 2288724
chr20.fa
/home/jigar/Desktop/nt_17/PAMs/PAM1_chr20.fa
#Duplicates, #Uniques
1048196 4822575
chr19.fa
/home/jigar/Desktop/nt_17/PAMs/PAM1_chr19.fa
#Duplicates, #Uniques
2038978 4627140
chr18.fa
/home/jigar/Desktop/nt_17/PAMs/PAM1_chr18.fa
#Duplicates, #Uniques
1003615 4535144
chr17.fa
/home/jigar/Desktop/nt_17/PAMs/PAM1_chr17.fa
#Duplicates, #Uniques
2166427 6031564
chr16.fa
/home/jigar/Desktop/nt_17/PAMs/PAM1_chr16.fa
#Duplicates, #Uniques
2116986 5838681
chr15.fa
/home/jigar/Desktop/nt_17/PAMs/PAM1_chr15.fa
#Duplicates, #Uniques
1842016 5336697
chr14.fa
/home/jigar/Desktop/nt_17/PAMs/PAM1_chr14.fa
#Duplicates, #Uniques
1575479 5531

The function below takes the duplicates and the unique files generated in the previous and does further filtering. In the previous step we had focused on analysis for only one chromosome at a time. In the function below, it expands the search to all chromosome at the same time. This will make sure that we only retain sequences (duplicates).

In [2]:
def remove_uniq_from_dup(dup,uni):
    file_path = "/home/user/Desktop/nt_17/dup_uni/"
    dup_file = file_path+dup+'_dup.fa'
    uni_file = file_path+uni+'_uni.fa'

    data_dup = pd.read_table(dup_file,header=None,names=['chr_p','seq'])
    data_uni = pd.read_table(uni_file,header=None,names=['chr_p','seq'])
    
    
    check =  data_uni.seq.values
    
    #print (check)
    mask = data_dup.seq.isin(check)
    
    #print (mask)
    print ('Sequences in Dup File before filter : {} '.format(len(data_dup)))
    print ('Sequences in Uni File before filter : {} '.format(len(data_uni)))
    
    print ('Sequences in Uni found in Dup : {}'.format(sum(mask)))
    new_dup = data_dup.ix[~mask]
    print ('Sequences in new Dup after filter : {} '.format(len(new_dup)))
    
    
    out_dup = file_path+uni+'_dup.fa'
    new_dup.to_csv(out_dup,header=None,index=None,sep='\t')
    
    print ("Done writing to file {}\n\n".format(out_dup))
    
    del new_dup
    del data_uni
    del check
    del data_dup
    del mask

In [3]:
import pandas as pd
import numpy as np

#dup_list=['all', 'all1','all2','all3']
#uni_list=['all1', 'all2','all3','all4']
dup_list=['all7']
uni_list=['all8']
for dd,uu in zip(dup_list,uni_list):
    print (dd,uu)
    remove_uniq_from_dup(dd,uu)

all7 all8
Sequences in Dup File before filter : 39300537 
Sequences in Uni File before filter : 21786293 
Sequences in Uni found in Dup : 1369320
Sequences in new Dup after filter : 37931217 
Done writing to file /home/jigar/Desktop/nt_17/dup_uni/all8_dup.fa




In [3]:
file_path = "/home/user/Desktop/nt_17/dup_uni/"
dup_file = file_path+'all8_dup.fa'
data_dup1 = pd.read_table(dup_file,header=None,names=['chr_p','seq'])

In [7]:
match1 = data_dup1.duplicated('seq',keep=False)
print (len(data_dup1),len(match1))
#This checks that all remaining are duplicates
print (sum(match1))

37931217 37931217
37931217


In [8]:
#Analyzing the dups remaining after removing one which may have been from uni
#Saving the data for further analysis.

data_dup1 = pd.concat([data_dup1,data_dup1.chr_p.str.split(':',expand=True)],axis=1)
data_dup1.columns =['chr_p','Sequence','Chromosome','Position']


In [3]:
data_dup1.head()


Unnamed: 0,chr_p,Sequence,Chromosome,Position
0,chr10:61294-61314(-),AGCTACTTAGGAGGCTGAGG,chr10,61294-61314(-)
1,chr10:61297-61317(-),TCCAGCTACTTAGGAGGCTG,chr10,61297-61317(-)
2,chr10:61293-61313(+),ACCTCAGCCTCCTAAGTAGC,chr10,61293-61313(+)
3,chr10:66163-66183(-),CTTCCTGATTTAAGCTAGGA,chr10,66163-66183(-)
4,chr10:66164-66184(-),TCTTCCTGATTTAAGCTAGG,chr10,66164-66184(-)


In [9]:

dup_file_1 = file_path+'all_dup_ex.fa'
dup_file_1
data_dup1.to_csv(dup_file_1,sep='\t')

In [10]:
del data_dup1['chr_p']
dup_file_2 = file_path+'all_dup_ex_nochrp.fa'
data_dup1.to_csv(dup_file_2,header=None,index=None,sep='\t')

In [11]:
data_dup1.head()

Unnamed: 0,Sequence,Chromosome,Position
0,GCCTCACTCTGTCACCC,chr10,61207-61224(+)
1,CACTCTGTCACCCAGGT,chr10,61211-61228(+)
2,TTGAGCCCAGGAGCTTG,chr10,61265-61282(-)
3,TACTTAGGAGGCTGAGG,chr10,61294-61311(-)
4,AGCTACTTAGGAGGCTG,chr10,61297-61314(-)
