## Removing duplicates and rows with sequences > 1022

#### <u>Duplicates</u>
We have found some duplicate sequences in the DBP train dataset so we are going to remove them.

#### <u>Sequence Length</u>
For `esm` pretrained models we will remove all rows with sequences longer than 1024, because these models, according to [ESM repo](https://github.com/facebookresearch/esm/issues/21), accept only sequences with maximum length of 1024.  

Later we realized that the models also do not accept sequences with length of 1023 and 1024.  
Sequences of length 1024 input into an ESM model give the following error:
```
ValueError: Sequence length 1026 above maximum  sequence length of 1024
```
Sequences of length 1023 input into an ESM model give the following error:
```
ValueError: Sequence length 1025 above maximum  sequence length of 1024
```

However, this error does not show up for sequences with lengths of 1022 amino acids or less, so for we removed all sequences longer than 1022 in the `esm` case, but not in the ProSE case.

#### <u>Tokens</u>
During processing of embeddings for the test dataset for `esm` there was another issue:
```
KeyError: 'v'
```
The `esm` accepts the following tokens:
```
proteinseq_toks = {
    'toks': ['L', 'A', 'G', 'V', 'S', 'E', 'R', 'T', 'I', 'D', 'P', 'K', 'Q', 'N', 'F', 'Y', 'M', 'H', 'W', 'C', 'X', 'B', 'U', 'Z', 'O', '.', '-']
}
```

In the last step of this notebook we will find all lowercase tokens and convert them to uppercase.

In [1]:
# Import dependencies
import pandas as pd

#### Duplicates
Duplicates are found in DBP (DNA Binding Proteins) files only.

In [2]:
# Create file paths
fn_test = '../../data/dna_binding/test.csv'
fn_train = '../../data/dna_binding/train.csv'

Create dataframes

In [3]:
df_test = pd.read_csv(fn_test)
df_test.head()

Unnamed: 0,code,sequence,label,origin
0,P27204|1,AKKRSRSRKRSASRKRSRSRKRSASKKSSKKHVRKALAAGMKNHLL...,1,https://github.com/hfuulgb/PDB-Fusion/tree/mai...
1,P53528|1,MVMVVNPLTAGLDDEQREAVLAPRGPVCVLAGAGTGKTRTITHRIA...,1,https://github.com/hfuulgb/PDB-Fusion/tree/mai...
2,P52684|1,MKDDINQEITFRKLSVFMMFMAKGNIARTAEAMKLSSVSVHRALHT...,1,https://github.com/hfuulgb/PDB-Fusion/tree/mai...
3,P10961|1,MNNAANTGTTNESNVSDAPRIEPLPSLNDDDIEKILQPNDIFTTDR...,1,https://github.com/hfuulgb/PDB-Fusion/tree/mai...
4,P06023|1,MAKPAKRIKSAAAAYVPQNRDAVITDIKRIGDLQREASRLETEMND...,1,https://github.com/hfuulgb/PDB-Fusion/tree/mai...


In [4]:
df_train = pd.read_csv(fn_train)
df_train.head()

Unnamed: 0,code,sequence,label,origin
0,Q6A8L0,MSGHSKWATTKHKKAAIDAKRGKLFARLIKNIEVAARLGGGDPSGN...,1,https://github.com/hfuulgb/PDB-Fusion/tree/mai...
1,Q7V7T9,MIGWLQGQKVEAWQQGTRQGVVLACAGVGYEVQIAPRHLSEMEHGQ...,1,https://github.com/hfuulgb/PDB-Fusion/tree/mai...
2,Q9ZUP2,MARILRNVYSLRSSLFSSELLRRSVVGTSFQLRGFAAKAKKKSKSD...,1,https://github.com/hfuulgb/PDB-Fusion/tree/mai...
3,Q2JVG1,MKCPRCGKQEIRVLESRSAEGGQSVRRRRECMSCGYRFTTYERIEF...,1,https://github.com/hfuulgb/PDB-Fusion/tree/mai...
4,Q9K4Q3,MTKADIIEGVYEKVGFSKKESAEIVELVFDTLKETLERGDKIKISG...,1,https://github.com/hfuulgb/PDB-Fusion/tree/mai...


Check dimensions

In [5]:
print(f'Test: {df_test.shape}')
print(f'Train: {df_train.shape}')

Test: (2272, 4)
Train: (14189, 4)


Check for duplicates

In [6]:
df_train.duplicated(subset='sequence', keep=False).sum()

312

In [7]:
df_test.duplicated(subset='sequence', keep=False).sum()

0

No duplicates in the test file but there are 312 rows in the train file.

We need to decide which duplicate to keep, the first or last occurrence of the sequence. The default is the first occurrence.

In [8]:
df_train.duplicated(subset='sequence', keep='first').sum()

173

We will drop 173 rows of these 312 duplicates. Some sequences have more than 2 copies.

In [9]:
df_train = df_train.drop_duplicates(subset='sequence')
df_train.shape

(14016, 4)

Save these two dataframes as .csv files for `prose` pretrained models

In [10]:
df_test.to_csv('../../data/dna_binding/test_prose.csv', index=False)
df_train.to_csv('../../data/dna_binding/train_prose.csv', index=False)

#### Sequence Length

`esm` pretrained models only accept sequences with a maximum length of 1022.  

We will remove all sequences with a length above 1022 amino acids.

**Train dataset**

In [11]:
(df_train.sequence.map(len) > 1022).sum()

908

There are 908 sequences longer than 1022 in the train dataset.

Let's remove them.

In [12]:
mask = (df_train.sequence.map(len) <= 1022)
df_train = df_train[mask]
df_train.shape

(13108, 4)

Working now on **test dataset**

In [13]:
(df_test.sequence.map(len) > 1022).sum()

191

There are 191 sequences longer than 1022 in the test data set. 

Let's remove them.

In [14]:
mask = (df_test.sequence.map(len) <= 1022)
df_test = df_test[mask]
df_test.shape

(2081, 4)

#### Lowercase Tokens

In [15]:
df_test.head(2)

Unnamed: 0,code,sequence,label,origin
0,P27204|1,AKKRSRSRKRSASRKRSRSRKRSASKKSSKKHVRKALAAGMKNHLL...,1,https://github.com/hfuulgb/PDB-Fusion/tree/mai...
1,P53528|1,MVMVVNPLTAGLDDEQREAVLAPRGPVCVLAGAGTGKTRTITHRIA...,1,https://github.com/hfuulgb/PDB-Fusion/tree/mai...


We will use this helper function to find lowercase tokens.

In [16]:
def lowercase_token(df):
    """ Find and print info about lowercase tokens
    Args:
        df: dataframe to check for lowercase tokens      
    Returns:
        Print info about lowercase tokens
    """   
    # Initialize Boolean value for printing
    prt = False
    # Loop through every row
    for idx in range(len(df)):
        row = df.iloc[idx]
        # Iterate through every character in the sequence
        for i, c in enumerate(row.sequence):
            # Printo info about lowercase token
            if c.islower():
                if not prt:
                    print('', 'Lowercase tokens:', '\n', '='*17)
                print(f'Sequence ID {row.code} (index={idx}) has lowercase \'{c}\' at position {i}')
                prt = True
    if not prt:
        print('No lowercase tokens!')

In [17]:
lowercase_token(df_test)

 Lowercase tokens: 
Sequence ID Q9LXX6|2 (index=2007) has lowercase 'v' at position 623


In [18]:
lowercase_token(df_train)

No lowercase tokens!


As we can see there are no lowercase tokens in the train data set, and there is only one in the test data set. 

Let's convert it to uppercase.

In [19]:
df_test.loc[df_test.code == 'Q9LXX6|2', 'sequence'] = df_test.iloc[2007].sequence.upper()

Check conversion

In [20]:
lowercase_token(df_test)

No lowercase tokens!


And now, save these two dataframes as .csv file for `esm` pretrained models.

In [21]:
df_test.to_csv('../../data/dna_binding/test_esm.csv', index=False)
df_train.to_csv('../../data/dna_binding/train_esm.csv', index=False)

This concludes cleanup of DBP .csv datasets.