# Find all that is duplicated

### Strategy
1) Do pairwise alignment with all sequences <br> 
2) If some meets the threshold of 0.98 (set by the BE-duplicates, which is the only known duplicates, 0.986) then they are tested for same organism name and same GCX id (just the numbers without the GCF_ and .1 part) <br>
3) If the pair is a MIBiG / NCBI pair, then only the names of the organisms has to match. If the pair is a NCBI / NCBI pair then their GCX id does also have to match, <br>
4) Keep the largest duplicate
5) Move all files to a new folder, and leave the smallest file from the duplicated pairs behind.

Alignment: https://stackoverflow.com/questions/55326610/show-only-dna-alignment-score-in-biopython

Make new folder with all intern duplicates deleted:
https://datatofish.com/copy-file-python/

# Code

In [1]:
import pandas as pd
import os
from Bio import SeqIO

In [2]:
all_DHs=pd.read_csv('DH_tablefile.txt', sep='\t')
all_DHs=all_DHs.drop(columns=['Unnamed: 0'])

In [3]:
BGC=all_DHs[all_DHs['name'].str.contains("BGC")].reset_index(drop=True)
RS=all_DHs[all_DHs['name'].str.contains("WP_")].reset_index(drop=True)
GB=all_DHs[~all_DHs['name'].str.contains("WP_|BGC")].reset_index(drop=True)

In [4]:
def find_filesize(group):
    size = []
    for filename in group:
        size.append(os.path.getsize(r"C:\Users\ASUS\Desktop\ny_bigscape_in\All\{}".format(filename)))
    return size

# print(find_filesize(['PZM89247.1.region001.gbk', 'RQX47637.1.region001.gbk']))

In [12]:
def get_alignment_scores(dataframe):
    groups=[]
    from Bio import Align 
    count=0
    for i in range(len(dataframe['AAseq'])):
        for j in range(i,len(dataframe['AAseq'])):
            if i != j:
                seq1 = list(dataframe['AAseq'][i:i+1])
                seq2 = list(dataframe['AAseq'][j:j+1])
                aligner = Align.PairwiseAligner()
#               alignments = aligner.align(seq1[0],seq2[0])
                score=aligner.score(seq1[0],seq2[0])
                value=score/max(len(seq1[0]),len(seq2[0]))
#                 count +=1
                if value >= 0.98 and value <= 1:
                    count+=1
                    groups.append([list(dataframe['name'][i:i+1]),list(dataframe['name'][j:j+1])])
#             print(count)
    return groups

In [6]:
def find_names(group):
    proteinIDs=[]
    filenames=[]
    for name in group:
            name=''.join(name)
            if name[-1] == 'X':
                file_name=name.split('X')
                if len(file_name[0]) < 3:
                    name=file_name[0]+'X'+file_name[1]
                    filenames.append(name+'.gbk')
#                     print(proteinID_real)
                else:
                    name=file_name[0]
                    filenames.append(name+'.gbk')
            else:
                filenames.append(name+'.gbk')
            proteinID=name.split('.')
            if len(proteinID) > 1:
                proteinID_real = proteinID[0]+'.'+proteinID[1]
                proteinIDs.append(proteinID_real)
            if len(proteinID)==1:
                proteinIDs.append(proteinID[0])
    return proteinIDs, filenames

# print(find_names(['PZM89247.1.region001X', 'BGC0000003XX']))
# print(find_names(['BGC0000093', 'WP_063273420.1']))

In [7]:
def find_BGC_data(gb_file_name):
    for rec in SeqIO.parse(gb_file_name, 'gb'):
        organism=rec.annotations["organism"]
    return organism
# test=find_BGC_data(r'C:\Users\ASUS\OneDrive - Aarhus Universitet\Bachelorprojekt\Fra skrivebordet\ny_bigscape_in\All\BGC0000001.gbk')
# print(test)

In [8]:
# Will only return group if the smallest name is similar to the largest one. These has to be checked manually
# https://www.geeksforgeeks.org/python-find-maximum-length-sub-list-in-a-nested-list/
# This might not work with BGC -> genbank dubs
# Antagelser: Hvis en MIBIG og NCBI resultat har samme organisme navn, og mathende dehydrataser, så er det samme resultat
# Antagelser: Hvis GCF og GCA har samme NUMMER så er det samme organisme.
def test_organism_same(group):
    df=pd.read_csv(r'C:\Users\ASUS\OneDrive - Aarhus Universitet\Bachelorprojekt\Fra skrivebordet\Finished_data\extract100kb_500_hits\blastp_500_all_data_no_duplicates_sorted.txt', sep='\t')
    test=[]
    full =[]
    GCX=[]
    for proteinID in group:
        if "BGC" in proteinID:
            path=r'C:\Users\ASUS\Desktop\ny_bigscape_in\All\{}.gbk'.format(proteinID)
            organism=find_BGC_data(path)
            full.append(organism)
            GCX.append(None)
        else:
            full.append(''.join(list(df[df['proteinID']==proteinID]['Organism'])))
            GCX_id=''.join(list(df[df['proteinID']==proteinID]['GCX_id']))
            GCX_id=GCX_id[4:13]
            GCX.append(GCX_id)
        splits=[]
    for name in range(len(full)):
        splits.append(full[name].split(' '))
    for i in range(len(splits)):
        test.append([])
        for j in range(min(map(len,splits))):
            test[i].append(splits[i][j]) 
        test[i]=' '.join(test[i])   
#     print(test)
#     print(group[0])
    if "BGC" in group[0] or "BGC" in group[1]:
        if test[0] == test[1]:
            return group, full        
    else:
        if test[0] == test[1] and GCX[0] == GCX[1]:
            return group, full

# print(test_organism_same(['BGC0001330', 'WP_047890614.1']))
# print(test_organism_same(['SFO58424.1', 'SPT57320.1']))

# print(test_organism_same(['APY90477.1', 'WP_103557568.1']))
# print(test_organism_same(['BGC0001330', 'BGC0001331']))


In [9]:
#It will find the smallest file to delete
def find_filenames_single_g(list_of_groups):
    smallest=[]
    del_list=[]
    for group in list_of_groups:
#         print(group)
        proteinIDs=find_names(group)[0]
#         print(proteinIDs)
        filenames=find_names(group)[1]
#         print(filenames)
        if test_organism_same(proteinIDs) != None:
            if [test_organism_same(proteinIDs)[1],test_organism_same(proteinIDs)[0]] not in del_list:
                del_list.append([test_organism_same(proteinIDs)[1],test_organism_same(proteinIDs)[0]])
            sizes=find_filesize(filenames)
            zipped_lists = zip(sizes,filenames)
            sorted_pairs=sorted(zipped_lists)
            tuples = zip(*sorted_pairs)
            if len(sizes) > 1:
                sizes, file_names = [list(tuple) for tuple in tuples]
            for i in file_names[0:-1]:
                if i not in smallest:
                    smallest.append(i)
    print(del_list)
    return smallest
# print(find_filenames_single_g([['BGC0000093', 'WP_063273420.1'],['BGC0000093', 'WP_063273420.1']]))                              
# print(find_filenames_single_g([['BGC0001330', 'BGC0001331'],['PZM89247.1.region001X', 'RQX47637.1.region001XX']]))

In [10]:
def intern_del_list(dataframe):
    list_of_groups=get_alignment_scores(dataframe)
    smallest=find_filenames_single_g(list_of_groups)
    return smallest

In [37]:
# GB_intern_dub=intern_del_list(GB)
# RS_intern_dub=intern_del_list(RS)
# BGC_intern_dub=intern_del_list(BGC[0:200])

# Find list of files to delete

In [11]:
across=intern_del_list(all_DHs)

[[['Amycolatopsis keratiniphila', 'Amycolatopsis keratiniphila'], ['AGM05849.1', 'WP_043848779.1']], [['Sandaracinus amylolyticus', 'Sandaracinus amylolyticus'], ['AKF07185.1', 'WP_083458533.1']], [['Streptomyces sp. CNQ-509', 'Streptomyces sp. CNQ-509'], ['AKH87109.1', 'WP_078898505.1']], [['Streptomyces alfalfae', 'Streptomyces alfalfae'], ['APY90477.1', 'WP_103557568.1']], [['Micromonospora sp. M42', 'Micromonospora sp. M42'], ['EWM63004.1', 'BGC0001328']], [['Streptomyces sp. OspMP-M45', 'Streptomyces sp. OspMP-M45'], ['SBU93598.1', 'WP_093675912.1']], [['Micromonospora purpureochromogenes', 'Micromonospora purpureochromogenes'], ['SCF27481.1', 'WP_088964688.1']], [["Nostoc sp. 'Peltigera membranacea cyanobiont' N6", "Nostoc sp. 'Peltigera membranacea cyanobiont'"], ['WP_104899371.1', 'BGC0001071']], [['Streptomyces sp. MspMP-M5', 'Streptomyces sp. MspMP-M5'], ['WP_106731933.1', 'BGC0001332']], [['Sorangium cellulosum', 'Sorangium cellulosum'], ['BGC0000014', 'BGC0000080']], [['Str

Manually adjust 9 enteries to keep MIBIG results instead of NCBI. Se notes. Adjust so it matches what you have sent to Daniela

In [26]:
# across[4] = 'EWM63004.1.region001.gbk'
# across[5]= 'SBU93598.1.region001.gbk'
# across[7] = 'WP_104899371.1.region001.gbk'
# across[8] = 'WP_106731933.1.region001.gbk' #= 'WP_047890614.1.region001.gbk'
# across[5] = 'WP_093675912.1.region001.gbk'
# # across[28] = 'WP_019032754.1.region001.gbk'
# across[14] = 'WP_050432983.1.region001.gbk'
# across[26] = 'WP_051610873.1.region001.gbk'
# across[27] = 'WP_047890614.1.region001.gbk'

# print(across)


['AGM05849.1.region001.gbk', 'AKF07185.1.region001.gbk', 'AKH87109.1.region001.gbk', 'APY90477.1.region001.gbk', 'EWM63004.1.region001.gbk', 'WP_093675912.1.region001.gbk', 'SCF27481.1.region001.gbk', 'WP_104899371.1.region001.gbk', 'WP_106731933.1.region001.gbk', 'BGC0000080.gbk', 'BGC0000066.gbk', 'BGC0001276.gbk', 'BGC0000164.gbk', 'BGC0001199.gbk', 'BGC0001677.gbk', 'BGC0001537.gbk']


In [12]:
# print('GB intern dubs:', GB_intern_dub)
# print('RS intern dubs:', RS_intern_dub)
# print('MIBIG intern dubs', MIBIG_intern_dubs)
print('across dubs', len(across))

across dubs 16


# Move files

In [39]:
import shutil
duplicates=across
All_no_dub=[]
o_path=r"C:\Users\ASUS\Desktop\ny_bigscape_in\All"
t_path=r'C:\Users\ASUS\Desktop\ny_bigscape_in\All_no_dub'
for file in os.listdir(o_path):
    if file not in duplicates:
        if file not in os.listdir(t_path):
            original =o_path+'\{}'.format(file)
            target =t_path+'\{}'.format(file)
            shutil.copyfile(original, target)
            All_no_dub.append(file)

# Some testing

In [1]:
from Bio import Align 
import pandas as pd

In [15]:
df=pd.read_csv("old_data_w_dubs_data_from_ALL_DH_blastp_AND_mibig_dub=all_DH_mibig_2_0 - test BGC dubs.txt", sep='\t')

In [16]:
result=get_alignment_scores(df)

In [17]:
print(result)

[[['BGC0001328'], ['EWM63004.1.region001']], [['BGC0001328'], ['WP_051610873.1.region001']], [['BGC0001328'], ['WP_140946099.1.region001']], [['BGC0001328X'], ['EWM63004.1.region001X']], [['BGC0001328X'], ['RQX47637.1.region001']], [['BGC0001328X'], ['WP_051610873.1.region001X']], [['BGC0001328X'], ['WP_064445511.1.region001X']], [['BGC0001328X'], ['WP_069088349.1.region001X']], [['BGC0001328X'], ['WP_124778977.1.region001X']], [['BGC0001328X'], ['WP_140946099.1.region001X']], [['BGC0001330'], ['WP_047890614.1.region001']], [['BGC0001331'], ['WP_019032754.1.region001']], [['BGC0001332'], ['WP_106731933.1.region001X']], [['BGC0001332X'], ['WP_106731933.1.region001']], [['AGM05849.1.region001'], ['WP_043848779.1.region001']], [['AGM05849.1.region001'], ['WP_072030953.1.region002']], [['AGM05849.1.region001'], ['WP_125690918.1.region001']], [['AKF07185.1.region001'], ['WP_083458533.1.region001']], [['AKF80291.1.region001'], ['WP_074958386.1.region001']], [['AKF80291.1.region001X'], ['WP_0

In [25]:

# #BGC0001330
# w='ITTADSWIVDEHRMQGHGLVPGTTYLEMVRAAVARHADGREIEFREVLFTSPVIVPDDQEREMLTTVERGDDGVLRFRVYSRGAAGRQEHCAGTVVLHDPVRRSPRTAGDLLAACDVQEVIEGEAALRHRLRLDFAADGGLIRFAVHGRWRSLSRVHVGTTGMVADLELPERYAGDLDTYLLHPALLDVVGGASRVYAAEGYYLPFWYGSLRFVRGLTSRMVCH'
# #BGC0001331
# e='ISTADSWIVDEHRMQGHGLVPGTTYLEMVRAAVAPYAHGREIEFREVLFTSPVIVPDDQEREMLTTVERGDDAVLRFRVHSRGAAGRQEHCTGTVVLHDPVRRPPRAAADLLAACGVQEVIEGEAALRRRLRLDFAAEGGLIRFAVHGRWRSLSRVHVGTSGMVADLELPERYAGDLDTYLLHPALLDVVGGASRVYAAEGYYLPFWYGSLRFVRGLTTRMVCH'
# #WP_019032754.1.region001
# r='TADSWIVDEHRMQGHGLVPGTTYLEMVRAAVAPYAHGREIEFREVLFTSPVIVPDDQEREMLTTVERGDDAVLRFRVHSRGAAGRQEHCTGTVVLHDPVRRPPRAAADLLAACGVQEVIEGEAALRRRLRLDFAAEGGLIRFAVHGRWRSLSRVHVGTSGMVADLELPERYAGDLDTYLLHPALLDVVGGASRVYAAEGYYLPFWYGSLRFVRGLTTRMVCHIRV'
# #WP_047890614.1.region001
# v='TADSWIVDEHRMQGHGLVPGTTYLEMVRAAVARHADGREIEFREVLFTSPVIVPDDQEREMLTTVERGDDGVLRFRVYSRGAAGRQEHCAGTVVLHDPVRRSPRTAGDLLAACDVQEVIEGEAALRHRLRLDFAADGGLIRFAVHGRWRSLSRVHVGTTGMVADLELPERYAGDLDTYLLHPALLDVVGGASRVYAAEGYYLPFWYGSLRFVRGLTSRMVCHIRV'

w='TEDSWIVADHRIEGHGLVPGTAYLELVRAAVAEQAAGRDIEIGDVQYMIPVVVPDGQSREVFTTIEERDGRWHFAVQSQSGAPGAAAWIDHARGTVAFPERAPETVRDLDELRAGCAVTKVLDTEESIKLGLRLDRFEKGGPIAFSFGPRWKCMREIQVGPRRVMATLRLDEAHHADLDDYLLHPALLDAAGGTARVHAPDTFYLPFSYRSLRFFHGLTSTVHAYV'
v='TDDSWIVADHRIQGHGLVPGTAYLELVRAAVAEQAAGRGIEIGDVQYMIPVVVPDGQSREVFTTIEERDGRWHFAVQSRTGAPGGVAWTDHARGTVAFFEPEPDTVRDLDELRAGCAVTEVLDTDESIKLGLRLDRFEKGGPIEFSFGPRWGCMREIQVGPKRVLATLRLDEEYHGDLDHYLLHPALLDAAGGTARVHAPDTYYLPFSYRSLRVFHGLTGTVHAYV'
# w='FEARAARTPDAVAVVGGAERLTYAELSAASDRLATRLRGLGVGAEGREDAVCLLMERSVRLPVALLAVVKAGGVYVPLDPRYPVSRMHLIMEDTGAGVLLVDGEGLDHPVTDGMHVLDVADAVAAVEPEGALELSHAGGPDRAAYIMYTSGSTGRPKGVAVTHGNVASLAADHVWRGGNHARVLMHSPTAFDASTYEMWVPLLSGGQVVVAPAGELDPEALVRTVREHGVTSAFFTAALFNLLVERDPAALAGMREVLAGGEALSPAVVAKALAAWPDTVLTNGYGPTETTTFAVLHRTREVADGTVPIGMPMDDSRAYVLDGRMRPVPVGVPGELYLAGGGLARGYVGRPGLTAQRFVACPFGAPGERMYRTGDLARRRADGRVEYLGRTDDQVKIR'
# v='FEARAARTPDAVAVVGGAERLTYAELSAASDRLATRLRGLGVGAEGREDAVCLLMERSVRLPVALLAVVKAGGVYVPLDPRYPVSRMRLIMEDTGAGVLLVDGEGLDHPVTDQMRVLDVAGELAADGVPEGAPQSAHAGGPDRAAYIMYTSGSTGRPKGVAITHGNVASLAADHVWGGGNHTRVLMHSPTAFDASTYEMWVPLLSGGQVVVAPAGELDPEALVRTVREHGVTSAFFTAALFNLLVERDPAALAGMREVLAGGEALSPAVVAKALAAWPDTVLTNGYGPTETTTFAVLHRTREVADGTVPIGMPMDDSRAYVLDGRMRPVPVGVPGELYLAGGGLARGYVGRPGLTAQRFVACPFGAPGERMYRTGDLARRRADGRVEYLGRTDDQVKIR'
# w='TADSWIVGDHRIQDHGLVPGTAYLELVRAAVAEQAAGRDVEISDVQYLVPVVVPDGQSREIFTTVEERDGRRHFAVQSRAGAPGAVTWTDHARGTVAFLDPEPDVVRDLDALLASCEVTDVLDTDESIKLGLRLDRFEKGGPIEFSFGPRWTCMKEIQVGPERVMATLRLDEEYHGDLDHYLLHPALLDAAGGTARVHAPDTYYLPFSYRSLRVLHGLTSTVHAYV'
# v='TADSWIVGDHRIQDHGLVPGTAYLELVRAAVAEQAAGRDVEISDVQYLVPVVVPDGQSREIYTTVEERDGRRHFAVQSRAGAPGAVTWTDHARGTVAFLDPEPDVVRDLDALLASCEVTDVLDTDESIKLGLRLDRFEKGGPIEFSFGPRWTCMKEIQVGPERVMATLRLDEEYHGDLDHYLLHPALLDAAGGTARVHAPDTYYLPFSYRSLRVLHGLTSTVHAYV'
# w= 'TADSWIVGDHRIQDHGLVPGTAYLELVRAAVAEQAAGRDVEISDVQYLVPVVVPDGQSREIYTTVEERDGRRHFAVQSRAGAPGAVTWTDHARGTVAFLDPEPDVVRDLDALLASCEVTDVLDTDESIKLGLRLDRFEKGGPIEFSFGPRWTCMKEIQVGPERVMATLRLDEEYHGDLDHYLLHPALLDAAGGTARVHAPDTYYLPFSYRSLRVLHGLTSTVHAYVE'
# v= 'TADSWIVGDHRIQDHGLVPGTAYLELVRAAVAEQAAGRDVEISDVQYLVPVVVPDGQSREIYTTVEERDGRRHFAVQSRAGAPGAVTWTDHARGTVAFLDPEPDVVRDLDALLASCEVTDVLDTDESIKLGLRLDRFEKGGPIEFSFGPRWTCMKEIQVGPERVMATLRLDEEYHGDLDHYLLHPALLDAAGGTARVHAPDTYYLPFSYRSLRVLHGLTSTVHAYVE'
aligner = Align.PairwiseAligner()
alignments = aligner.align(w, v)
score=aligner.score(w, v)
print(score)
print(len(w),len(v))
print(score/max(len(w),len(v)))

202.0
226 226
0.8938053097345132


# OLD BUT GOOD CONSIDERATIONS
https://stackoverflow.com/questions/27975069/how-to-filter-rows-containing-a-string-pattern-from-a-pandas-dataframe <br>
I have realised that some MIBiG enteries are the same, just with different compund names <br>
Likewise I know that some RF and GB records are duplicates because some BGCs contains multiple proteins from the blastp hit. <br>
In that case, it does not matter which one that gets deleted internally in those groups. I'm interested in knowing the names of the duplicates across groups.

In [4]:
BGC=sorted_dub_DHs[sorted_dub_DHs['name'].str.contains("BGC")]
RS=sorted_dub_DHs[sorted_dub_DHs['name'].str.contains("WP_")]
GB=sorted_dub_DHs[~sorted_dub_DHs['name'].str.contains("WP_|BGC")]

#### Find intern duplicates
These should be deleted from antiSMASH and BiG-SCAPE hits. <br>
I perform 10 random sampelings across the three groups.

In [5]:
BGC_del=BGC[BGC.duplicated(subset='AAseq', keep=False)]
RS_del=RS[RS.duplicated(subset='AAseq', keep=False)]
GB_del=GB[GB.duplicated(subset='AAseq', keep=False)]
print('Delete files with these DHs in BGCs',len(BGC_del))
print('Delete files with these DHs in RefSeq',len(RS_del))
print('Delete files with these DHs in GB',len(GB_del))
print('Total internal DH dublicates', len(BGC_del)+len(RS_del)+len(GB_del))

Delete files with these DHs in BGCs 293
Delete files with these DHs in RefSeq 251
Delete files with these DHs in GB 33
Total internal DH dublicates 577


#### Random sampeling
In BiG-SCAPE I want only different compunds, and I think it is best to use the same dataset in all the analyses.

BGCs: Some are excately the same, some has different compund names, some are inverted, some look like pieces of each other, some are from different organisms.  I delete the duplicates to be sure that no duplicates are present.
RS: Two KR DH domains are present in one BGC, some are from different organisms (WP_055409222), some is on edge, some is not (WP_082722753, WP_062525444). Keep the largest record??
GB: Fragments of each other from different organisms, two KR DH domains in one BGC.


In [6]:
BGC_sample=BGC_del.sample(n=7)
RS_sample=RS_del.sample(n=7)
GB_sample=GB_del.sample(n=7)

https://stackoverflow.com/questions/28679930/how-to-drop-rows-from-pandas-data-frame-that-contains-a-particular-string-in-a-p