# Tandem Repeat analysis
#### Written by Jigar N. Bandaria
In this notebook I do further analysis of the sequences that were saved in the previous notebook 'sgRNA analysis 3'. Here the sequences that can form hotspots are further analyzed, and only those that are repeated 4 times within 10kbp window are kept, while the remaining sequences are removed.

In [1]:
%matplotlib inline
import pandas as pd
import numpy as np
import os

Let's first do some preliminary analysis on chr9 and then we can apply the analysis to all chromosomes.

In [2]:
file_path = "/home/user/Desktop/nt_17/chrom_only/"
filename = file_path+"chr9"+"_4more.fa"
tmp1 = pd.read_csv(filename,header=None,sep='\t',names=['Sequence','Chromosome','Position'])
x = "chr9"   
print("{0} : {1} : {2}".format(x,len(tmp1),len(tmp1.Sequence.value_counts())))

#This is exactly what we had obtained in the previous notebook.

chr9 : 271391 : 51593


Right now the Position column contains the start position, stop position and strand information all togehter (123165-123182(+)). I will use regex to separate them in to 3 columns.

In [3]:
import re
j1 =np.asarray([re.compile('(?<=[0-9-+])[-()]').split(x)[:3] for x in tmp1.Position.tolist()])

tmp2 = pd.DataFrame(j1,columns=['Start','Stop','Strand'])

tmp2[['Start','Stop']]=tmp2[['Start','Stop']].astype(np.int32)

data=pd.concat([tmp1,tmp2],axis=1)
del data['Position']
data.dtypes

Sequence      object
Chromosome    object
Start          int32
Stop           int32
Strand        object
dtype: object

In [4]:
data.head() #Position is separated into 3 columns

Unnamed: 0,Sequence,Chromosome,Start,Stop,Strand
0,GGGGTCCTTAGTGGAGG,chr9,51934,51951,-
1,GGGGGTCCTTAGTGGAG,chr9,51935,51952,-
2,TGGGGGTCCTTAGTGGA,chr9,51936,51953,-
3,AGTTTGGGGGTCCTTAG,chr9,51940,51957,-
4,AGGCATATCCTATTTAC,chr9,52844,52861,-


Next we group the data by sequences and start position. 

In [6]:
grp1=data.groupby(['Sequence'])['Start'].unique()
grp2=grp1[:10]
grp2

Sequence
AAAAAAAACTGTGGGGA             [39210382, 39739600, 40553742, 43765994]
AAAAAAAAGCAGCTCAC             [39229939, 39759162, 40573179, 43746558]
AAAAAAAAGGGGGAACA    [39543872, 40871738, 41140449, 41211833, 41688...
AAAAAAACCACAGGGCA             [42362338, 43139552, 67920737, 69375965]
AAAAAAAGAGGAGCTGG    [44851184, 45018920, 45754365, 46140695, 46363...
AAAAAAAGCAAGCCCTT    [44782288, 45088713, 46209777, 46294488, 67703...
AAAAAAAGCTCTAGCCT             [41884157, 44474005, 46914567, 65975198]
AAAAAAATGAGGGTGGC    [42309493, 43192309, 45382476, 68351980, 70704...
AAAAAAATGCCAGTAGC    [39423308, 39952550, 40768023, 41262916, 41568...
AAAAAACACAGGGGCAT    [38966529, 40136258, 44734615, 47194231, 65765...
Name: Start, dtype: object

In [7]:
#This is for a 10 kbp window. Here are checking if the max-min in the above grouped data (for start position) is within
# 10 kbp window.
mask2= [(np.max(x)-np.min(x))<10000 for x in grp1]

In [8]:
np.sum(mask2) # Thus based on 10 kbp window  these are the number of hotspots.

3161

In [10]:
remain = grp1.index[mask2]
len(remain)

3161

In [11]:
#We now collect the sequences that are within 10 kbp window.
in_10kb = data[data.Sequence.isin(remain)] 
print(in_10kb.head())
print(len(in_10kb))

              Sequence Chromosome   Start    Stop Strand
307  CTCTGATCACAGAACCT       chr9  320115  320132      -
308  TCTCTGATCACAGAACC       chr9  320116  320133      -
309  ACTCACGGAAAAAGCCC       chr9  320098  320115      +
310  GGTTCTGTGATCAGAGA       chr9  320116  320133      +
311  CTGTGATCAGAGATGGC       chr9  320120  320137      +
22588


In [12]:
#Checking the last 50 entries in the dataframe.
in_10kb.groupby(['Sequence','Start','Stop']).sum().tail(50)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Chromosome,Strand
Sequence,Start,Stop,Unnamed: 3_level_1,Unnamed: 4_level_1
TTTGCCCACGACCAGCC,115851834,115851851,chr9,+
TTTGCGCCCGCTCCTGG,115822375,115822392,chr9,-
TTTGCGCCCGCTCCTGG,115827824,115827841,chr9,-
TTTGCGCCCGCTCCTGG,115833272,115833289,chr9,-
TTTGCGCCCGCTCCTGG,115838718,115838735,chr9,-
TTTGCGCCCGCTCCTGG,115844160,115844177,chr9,-
TTTGCGCCCGCTCCTGG,115849606,115849623,chr9,-
TTTGCTCCTTTTCCTTG,137918977,137918994,chr9,-
TTTGCTCCTTTTCCTTG,137919015,137919032,chr9,-
TTTGCTCCTTTTCCTTG,137919053,137919070,chr9,-


The preliminary analysis on chr9 works well. So below I wrote a function to repeat the analysis on all the chromosomes.

In [13]:
import re
chr_num = ['chr1','chr2','chr3','chr4','chr5','chr6','chr7','chr8','chr9','chr10',
           'chr11','chr12','chr13','chr14','chr15','chr16','chr17','chr18','chr19','chr20',
           'chr21','chr22','chrX','chrY']
#chr_num = ['chr1','chr2','chr3']
print("Chrom : Reads : Hotspots : HS_in100kb")
for x in chr_num:
    filename = "/home/jigar/Desktop/nt_17/chrom_only/"+x+"_4more.fa"
    tmp1 = pd.read_csv(filename,header=None,sep='\t',names=['Sequence','Chromosome','Position'])
    
    
    j1 =np.asarray([re.compile('(?<=[0-9-+])[-()]').split(x)[:3] for x in tmp1.Position.tolist()])

    tmp2 = pd.DataFrame(j1,columns=['Start','Stop','Strand'])

    tmp2[['Start','Stop']]=tmp2[['Start','Stop']].astype(np.int32)

    data=pd.concat([tmp1,tmp2],axis=1)
    del data['Position']
    
    grp1=data.groupby(['Sequence'])['Start'].unique()
    mask2= [np.max(x)-np.min(x)<10000 for x in grp1] #CHANGE THIS FOR 1000 or 10000
    remain = grp1.index[mask2]
    in_100kb = data[data.Sequence.isin(remain)]

    file1 = "/home/jigar/Desktop/nt_17/chrom_only/"+ x + "_with_overlap.fa"

    in_100kb.to_csv(file1,header=None,index=None,sep='\t')
    
    print("{0} : {1} : {2} : {3}".format(x,len(tmp1),len(tmp1.Sequence.value_counts()),len(in_100kb.Sequence.value_counts())))
    


Chrom : Reads : Hotspots : HS_in100kb
chr1 : 168852 : 22919 : 5847
chr2 : 95715 : 14935 : 5034
chr3 : 17838 : 2054 : 1650
chr4 : 41252 : 4405 : 3279
chr5 : 65095 : 9281 : 3133
chr6 : 32262 : 4602 : 3290
chr7 : 108754 : 17362 : 4822
chr8 : 64918 : 7933 : 3996
chr9 : 271391 : 51593 : 3161
chr10 : 118367 : 20675 : 4682
chr11 : 33668 : 5162 : 2702
chr12 : 28501 : 3468 : 2910
chr13 : 27791 : 3377 : 3128
chr14 : 19806 : 3161 : 1238
chr15 : 129763 : 22458 : 996
chr16 : 103819 : 17726 : 3052
chr17 : 98052 : 16520 : 3614
chr18 : 19003 : 2308 : 2274
chr19 : 95626 : 13207 : 5177
chr20 : 18169 : 2570 : 2199
chr21 : 11412 : 1627 : 1538
chr22 : 49912 : 8067 : 2085
chrX : 82399 : 13193 : 2645
chrY : 154979 : 28816 : 660


The table above give us the total sequences present on a chromosome, and the number of hotspots formed overall and within 10 kbp window.

The data for each chromosome is saved for further analysis.