# Tandem Repeat analysis
#### Written by Jigar N. Bandaria
In this file we continue the analysis on the files that were saved in the previous notebook 'sgRNA analysis 1'

In [1]:
%matplotlib inline
import pandas as pd
import numpy as np

In [3]:
#Loading and checking the contents of the file
dups = pd.read_csv('/home/user/Desktop/nt_17/dup_uni/all_dup_ex_nochrp.fa',header=None,sep='\t',names=['Sequence','Chromosome','Position'])
dups.head()

Unnamed: 0,Sequence,Chromosome,Position
0,GCCTCACTCTGTCACCC,chr10,61207-61224(+)
1,CACTCTGTCACCCAGGT,chr10,61211-61228(+)
2,TTGAGCCCAGGAGCTTG,chr10,61265-61282(-)
3,TACTTAGGAGGCTGAGG,chr10,61294-61311(-)
4,AGCTACTTAGGAGGCTG,chr10,61297-61314(-)


In [4]:
len(dups) #Total number of sequences

37931217

In [5]:
#Grouping and checking the chromosome distribution of each sequence
data2 = dups.groupby(['Sequence'])['Chromosome'].unique()
data2

#There are some sequences that are present on multiple chromosomes.

Sequence
AAAAAAAAAACCGCACG                                               [chr2]
AAAAAAAAAACCGGTCC                                               [chrY]
AAAAAAAAAAGACCCGG                                               [chrX]
AAAAAAAAAAGGAGCCG                                               [chr9]
AAAAAAAAAAGGCGAGG                                              [chr19]
AAAAAAAAAAGGGAGCG                                               [chr1]
AAAAAAAAAAGGGCAGC                                               [chr4]
AAAAAAAAACAGCGGCC                                               [chr2]
AAAAAAAAACAGCTGGC                                              [chr17]
AAAAAAAAACATGGCCC                                              [chr22]
AAAAAAAAACCGCACTG                                               [chr1]
AAAAAAAAACCGCGCAA                                               [chr4]
AAAAAAAAACCGGTCCT                                               [chrY]
AAAAAAAAACTCGCCTG                                               [chr

In [6]:
print ("Dups : {}".format(len(dups)))
print ("Data2 : {}".format(len(data2)))

Dups : 37931217
Data2 : 2450798


Below we remove those duplicates that are spread over multiple chromosomes.

In [7]:
more_than_one = [len(x)!=1 for x in data2.values]
mult_chrom = data2.index[more_than_one].tolist()
print ("Length of more_than_one : {}".format(len(more_than_one)))
print ("Length of mult_chrom : {}".format(len(mult_chrom)))
mult_chrom[:10]

Length of more_than_one : 2450798
Length of mult_chrom : 57610


['AAAAAAAAAGCCAGGCA',
 'AAAAAAAAAGCCGGGCA',
 'AAAAAAAAAGCCGGGTG',
 'AAAAAAAAAGCTGGGCA',
 'AAAAAAAAAGGCTGGGG',
 'AAAAAAAAATCCCTGCG',
 'AAAAAAAAGAGGAGGGC',
 'AAAAAAAAGGCCAGGCA',
 'AAAAAAAAGGCCGGGCA',
 'AAAAAAAAGGCTGGGCA']

In [8]:
dups_reduced = dups.mask(dups.Sequence.isin(mult_chrom))


In [9]:
dups_reduced.dropna(how='any',inplace=True)
dups_reduced

Unnamed: 0,Sequence,Chromosome,Position
1213,TTCAGCTTCCAGCTCCC,chr10,123165-123182(+)
1216,TTCAGCTTCCAGCTCCC,chr10,123185-123202(+)
1252,CTCAGGGTGGAGGCTCA,chr10,125637-125654(-)
1253,GCTCAGGGTGGAGGCTC,chr10,125638-125655(-)
1254,CTGGGCTGAGCTCAGGG,chr10,125647-125664(-)
1255,GCTCTGGGCTGAGCTCA,chr10,125650-125667(-)
1256,AGCTCTGGGCTGAGCTC,chr10,125651-125668(-)
1257,CTCAGGGTGGAGGCTCA,chr10,125673-125690(-)
1258,GCTCAGGGTGGAGGCTC,chr10,125674-125691(-)
1259,CTGGGCTGAGCTCAGGG,chr10,125683-125700(-)


Saving the sequences that are present only on one chromosome. This will be used for further analysis in the next notebook.

In [10]:
file_path = "/home/user/Desktop/nt_17/dup_uni/"
filename = file_path+"dup_with_one_chrom.fa"
dups_reduced.to_csv(filename,header=None,index=None,sep='\t')

In [11]:
#Rechecking again to make sure that only sequences which are present on only one chromosome are retained.
data3 = dups_reduced.groupby(['Sequence'])['Chromosome'].unique().value_counts()

In [12]:
data3

[chr1]     223529
[chr2]     205628
[chr7]     205385
[chrX]     163933
[chr16]    149186
[chr9]     139911
[chrY]     134147
[chr17]    132270
[chr10]    132063
[chr15]    128045
[chr5]      91732
[chr19]     91191
[chr11]     79899
[chr8]      75362
[chr6]      74067
[chr22]     66408
[chr12]     59395
[chr3]      58371
[chr4]      56389
[chr14]     35102
[chr13]     33237
[chr20]     28389
[chr18]     18848
[chr21]     10701
Name: Chromosome, dtype: int64

As can be seen above we only have sequences that present only one chromosome. We don't have any sequences that are [chr1][chr3]...etc. Also we now have a count of how many sequences are present on each chromosome.