## Removing duplicates and rows with sequences > 1024

We have found some duplicate sequences in DBP datasets so we are going to remove them.

For `esm` pretrained models we will remove all rows with sequences longer than 1024, because these models accept only sequences with maximum length of 1024.  


In [1]:
import pandas as pd

Duplicates are found in DBP (DNA Binding Proteins) files only

In [2]:
# Create file paths
fn_test = '../data/dna_binding/test.csv'
fn_train = '../data/dna_binding/train.csv'

Create dataframes

In [3]:
df_test = pd.read_csv(fn_test)
df_test.head()

Unnamed: 0,code,sequence,label,origin
0,P27204|1,AKKRSRSRKRSASRKRSRSRKRSASKKSSKKHVRKALAAGMKNHLL...,1,https://github.com/hfuulgb/PDB-Fusion/tree/mai...
1,P53528|1,MVMVVNPLTAGLDDEQREAVLAPRGPVCVLAGAGTGKTRTITHRIA...,1,https://github.com/hfuulgb/PDB-Fusion/tree/mai...
2,P52684|1,MKDDINQEITFRKLSVFMMFMAKGNIARTAEAMKLSSVSVHRALHT...,1,https://github.com/hfuulgb/PDB-Fusion/tree/mai...
3,P10961|1,MNNAANTGTTNESNVSDAPRIEPLPSLNDDDIEKILQPNDIFTTDR...,1,https://github.com/hfuulgb/PDB-Fusion/tree/mai...
4,P06023|1,MAKPAKRIKSAAAAYVPQNRDAVITDIKRIGDLQREASRLETEMND...,1,https://github.com/hfuulgb/PDB-Fusion/tree/mai...


In [51]:
df_train = pd.read_csv(fn_train)
df_train.head()

Unnamed: 0,code,sequence,label,origin
0,Q6A8L0,MSGHSKWATTKHKKAAIDAKRGKLFARLIKNIEVAARLGGGDPSGN...,1,https://github.com/hfuulgb/PDB-Fusion/tree/mai...
1,Q7V7T9,MIGWLQGQKVEAWQQGTRQGVVLACAGVGYEVQIAPRHLSEMEHGQ...,1,https://github.com/hfuulgb/PDB-Fusion/tree/mai...
2,Q9ZUP2,MARILRNVYSLRSSLFSSELLRRSVVGTSFQLRGFAAKAKKKSKSD...,1,https://github.com/hfuulgb/PDB-Fusion/tree/mai...
3,Q2JVG1,MKCPRCGKQEIRVLESRSAEGGQSVRRRRECMSCGYRFTTYERIEF...,1,https://github.com/hfuulgb/PDB-Fusion/tree/mai...
4,Q9K4Q3,MTKADIIEGVYEKVGFSKKESAEIVELVFDTLKETLERGDKIKISG...,1,https://github.com/hfuulgb/PDB-Fusion/tree/mai...


Check dimensions

In [5]:
print(f'Test: {df_test.shape}')
print(f'Train: {df_train.shape}')

Test: (2272, 4)
Train: (14189, 4)


Check duplicastes

In [18]:
df_train.duplicated(subset='sequence', keep=False).sum()

312

In [84]:
df_test.duplicated(subset='sequence', keep=False).sum()

0

No duplicates in the test file but there are 312 rows in the train file.

We need to decide which duplicate to keep, first or last. Default is first.

In [85]:
df_train.duplicated(subset='sequence', keep='first').sum()

173

We will drop 173 rows of 312. Somesequences have more than 2 copies.

In [89]:
df_train = df_train.drop_duplicates(subset='sequence')
df_train.shape

(14016, 4)

Save these two dataframes as .csv file for `prose` pretrained models

In [90]:
df_test.to_csv('../data/dna_binding/test_prose.csv', index=False)
df_train.to_csv('../data/dna_binding/train_prose.csv', index=False)

`esm` pretrained models accept only sequences with maximum length of 1024.  

We will remove all sequences with the length above 1024.

**Train dataset**

In [122]:
(df_train.sequence.map(len) > 1024).sum()

904

There are 904 rows with sequences longer than 1024 in the train dataset.

Let's remove them.

In [124]:
mask = (df_train.sequence.map(len) <= 1024)
df_train = df_train[mask]
df_train.shape

(13112, 4)

Working now on **test dataset**

In [125]:
(df_test.sequence.map(len) > 1024).sum()

190

There are 190 rows with sequences longer than 1024 in the test data set. 

Let's remove them.

In [126]:
mask = (df_test.sequence.map(len) <= 1024)
df_test = df_test[mask]
df_test.shape

(2082, 4)

And now, save these two dataframes as .csv file for `esm` pretrained models

In [127]:
df_test.to_csv('../data/dna_binding/test_esm.csv', index=False)
df_train.to_csv('../data/dna_binding/train_esm.csv', index=False)

This concludes cleanup of DBP .csv datasets.