# Conversion of csv files to fasta files

We have received all files in the csv format. Both repositories with pretrained transformer models ([Facebook's Evolutionary Scale Modeling](https://github.com/facebookresearch/esm), [Protein Sequence Embeddings (ProSE)](https://github.com/tbepler/prose)) are using fasta formatted files as input. That is why, in this notebook we will transfer all our csv files to fasta files. 

We have created the script *convert_csv_to_fasta.py* to help us with that.

We will follow the fasta header formatting used by the Evolutionary Scale Modeling (ESM) team from Facebook AI Research and adjust it for our data:  

`>{index}|{sequence_id}|{label}`  

where:
- `index` - index of the sequence in the fasta file  
- `sequence_id` - PBD code for the sequence  
- `label` - class label (0, 1) for the protein  



In [1]:
# Import dependencies
import pandas as pd

In [2]:
# Import the script from different folder
import sys  
sys.path.append('../../scripts')

import convert_csv_to_fasta as cf

## ACP - Anti Cancer Peptides

Therapeutic peptide drugs which target and kill cancer cells.


ACP datasets do not have PBD Codes listed. The conversion script will insert sequential protein "names" needed for fasta format.

The format of the added names is <code>Protein_seq_xxxx</code> where <code>xxxx</code> is a sequential row number padded with zeros.

### Train Dataset

In [3]:
# Define paths
csv_file = '../../data/acp/train_data.csv'
fasta_file = '../../data/acp/train_data.fa'

In [4]:
# Create dataframe to see data format
df = pd.read_csv(csv_file)
print(df.shape)
df.head()

(1378, 2)


Unnamed: 0,sequences,label
0,RRWWRRWRRW,0
1,GWKSVFRKAKKVGKTVGGLALDHYLG,0
2,ALWKTMLKKLGTMALHAGKAALGAAADTISQGTQ,1
3,GLFDVIKKVAAVIGGL,1
4,VAKLLAKLAKKVL,1


Run conversion

In [5]:
# Run the script
cf.csv_to_fasta(csv_file, fasta_file)

### Test Dataset

In [6]:
# Define paths
csv_file = '../../data/acp/test_data.csv'
fasta_file = '../../data/acp/test_data.fa'

In [7]:
# Create dataframe to see data format
df = pd.read_csv(csv_file)
print(df.shape)
df.head()

(344, 2)


Unnamed: 0,sequences,label
0,FLPLLLSALPSFLCLVFKKC,0
1,DKLIGSCVWLAVNYTSNCNAECKRRGYKGGHCGSFLNVNCWCET,0
2,AVKDTYSCFIMRGKCRHECHDFEKPIGFCTKLNANCYM,0
3,GLPTCGETCFGGTCNTPGCTCDPWPVCTHN,1
4,ENCGRQAG,0


Run conversion

In [8]:
# Run the script
cf.csv_to_fasta(csv_file, fasta_file)

Check if both files were created.

In [9]:
!tree ../../data/acp -nDhL 1

../../data/acp
├── [4.0K Sep 19 09:51]  esm
├── [4.0K Sep 13 10:39]  prose
├── [9.0K Dec  1  2021]  test_data.csv
├── [ 17K Oct  2 19:13]  test_data.fa
├── [ 36K Dec  1  2021]  train_data.csv
└── [ 66K Oct  2 19:13]  train_data.fa

2 directories, 4 files


## AMP - Antimicrobial Peptides

Peptides with a wide range of inhibitory effects against bacteria, fungi, parasites, and viruses.



We have only one file for this type of peptide.

In [10]:
# Define paths
csv_file = '../../data/amp/all_data.csv'
fasta_file = '../../data/amp/all_data.fa'

In [11]:
# Create dataframe to see data format
df = pd.read_csv(csv_file)
print(df.shape)
df.head()

(4042, 3)


Unnamed: 0,PDBs_code,SequenceID,label
0,AP02484,GMASKAGSVLGKITKIALGAL,1
1,AP02630,NIGLFTSTCFSSQCFSSKCFTDTCFSSNCFTGRHQCGYTHGSC,1
2,AP01427,GAIKDALKGAAKTVAVELLKKAQCKLEKTC,1
3,AP02983,FFGRLKAVFRGARQGWKEHRY,1
4,AP01815,DFGCARGMIFVCMRRCARMYPGSTGYCQGFRCMCDTMIPIRRPPFIMG,1


Run conversion

In [12]:
# Run the script
cf.csv_to_fasta(csv_file, fasta_file)

Check if the file was created.

In [13]:
!tree ../../data/amp -nDhL 1

../../data/amp
├── [193K Feb  3  2022]  all_data.csv
├── [216K Oct  2 19:13]  all_data.fa
├── [4.0K Sep 20 15:58]  esm
└── [4.0K Sep 20 15:53]  prose

2 directories, 2 files


## DNA-Binding proteins

Proteins with an important role in DNA replication, DNA methylation, gene expression, and other biological processes.


There were some duplicates so we had to remove them in order to have train and test sets ready to be input into the `prose` models.  

`esm` models only accept sequences with a maximum length of 1024 amino acids, so we removed the incompatible sequences and created another set of datasets for `esm`.  
Refer to the notebook *DBP - Duplicates and seq >1022.ipynb* in this folder.

### Train Dataset - `prose`

In [14]:
# Define paths
csv_file = '../../data/dna_binding/train_prose.csv'
fasta_file = '../../data/dna_binding/train_prose.fa'

In [15]:
# Create dataframe to see data format
df = pd.read_csv(csv_file)
print(df.shape)
df.head()

(14016, 4)


Unnamed: 0,code,sequence,label,origin
0,Q6A8L0,MSGHSKWATTKHKKAAIDAKRGKLFARLIKNIEVAARLGGGDPSGN...,1,https://github.com/hfuulgb/PDB-Fusion/tree/mai...
1,Q7V7T9,MIGWLQGQKVEAWQQGTRQGVVLACAGVGYEVQIAPRHLSEMEHGQ...,1,https://github.com/hfuulgb/PDB-Fusion/tree/mai...
2,Q9ZUP2,MARILRNVYSLRSSLFSSELLRRSVVGTSFQLRGFAAKAKKKSKSD...,1,https://github.com/hfuulgb/PDB-Fusion/tree/mai...
3,Q2JVG1,MKCPRCGKQEIRVLESRSAEGGQSVRRRRECMSCGYRFTTYERIEF...,1,https://github.com/hfuulgb/PDB-Fusion/tree/mai...
4,Q9K4Q3,MTKADIIEGVYEKVGFSKKESAEIVELVFDTLKETLERGDKIKISG...,1,https://github.com/hfuulgb/PDB-Fusion/tree/mai...


Run conversion

In [16]:
# Run the script
cf.csv_to_fasta(csv_file, fasta_file)

### Test Dataset - `prose`

In [17]:
# Define paths
csv_file = '../../data/dna_binding/test_prose.csv'
fasta_file = '../../data/dna_binding/test_prose.fa'

In [18]:
# Create dataframe to see data format
df = pd.read_csv(csv_file)
print(df.shape)
df.head()

(2272, 4)


Unnamed: 0,code,sequence,label,origin
0,P27204|1,AKKRSRSRKRSASRKRSRSRKRSASKKSSKKHVRKALAAGMKNHLL...,1,https://github.com/hfuulgb/PDB-Fusion/tree/mai...
1,P53528|1,MVMVVNPLTAGLDDEQREAVLAPRGPVCVLAGAGTGKTRTITHRIA...,1,https://github.com/hfuulgb/PDB-Fusion/tree/mai...
2,P52684|1,MKDDINQEITFRKLSVFMMFMAKGNIARTAEAMKLSSVSVHRALHT...,1,https://github.com/hfuulgb/PDB-Fusion/tree/mai...
3,P10961|1,MNNAANTGTTNESNVSDAPRIEPLPSLNDDDIEKILQPNDIFTTDR...,1,https://github.com/hfuulgb/PDB-Fusion/tree/mai...
4,P06023|1,MAKPAKRIKSAAAAYVPQNRDAVITDIKRIGDLQREASRLETEMND...,1,https://github.com/hfuulgb/PDB-Fusion/tree/mai...


Run conversion

In [19]:
# Run the script
cf.csv_to_fasta(csv_file, fasta_file)

### Train Dataset - `esm`

In [20]:
# Define paths
csv_file = '../../data/dna_binding/train_esm.csv'
fasta_file = '../../data/dna_binding/train_esm.fa'

In [21]:
# Create dataframe to see data format
df = pd.read_csv(csv_file)
print(df.shape)
df.head()

(13108, 4)


Unnamed: 0,code,sequence,label,origin
0,Q6A8L0,MSGHSKWATTKHKKAAIDAKRGKLFARLIKNIEVAARLGGGDPSGN...,1,https://github.com/hfuulgb/PDB-Fusion/tree/mai...
1,Q7V7T9,MIGWLQGQKVEAWQQGTRQGVVLACAGVGYEVQIAPRHLSEMEHGQ...,1,https://github.com/hfuulgb/PDB-Fusion/tree/mai...
2,Q9ZUP2,MARILRNVYSLRSSLFSSELLRRSVVGTSFQLRGFAAKAKKKSKSD...,1,https://github.com/hfuulgb/PDB-Fusion/tree/mai...
3,Q2JVG1,MKCPRCGKQEIRVLESRSAEGGQSVRRRRECMSCGYRFTTYERIEF...,1,https://github.com/hfuulgb/PDB-Fusion/tree/mai...
4,Q9K4Q3,MTKADIIEGVYEKVGFSKKESAEIVELVFDTLKETLERGDKIKISG...,1,https://github.com/hfuulgb/PDB-Fusion/tree/mai...


Run conversion

In [22]:
# Run the script
cf.csv_to_fasta(csv_file, fasta_file)

### Test Dataset - `esm`

In [23]:
# Define paths
csv_file = '../../data/dna_binding/test_esm.csv'
fasta_file = '../../data/dna_binding/test_esm.fa'

In [24]:
# Create dataframe to see data format
df = pd.read_csv(csv_file)
print(df.shape)
df.head()

(2081, 4)


Unnamed: 0,code,sequence,label,origin
0,P27204|1,AKKRSRSRKRSASRKRSRSRKRSASKKSSKKHVRKALAAGMKNHLL...,1,https://github.com/hfuulgb/PDB-Fusion/tree/mai...
1,P53528|1,MVMVVNPLTAGLDDEQREAVLAPRGPVCVLAGAGTGKTRTITHRIA...,1,https://github.com/hfuulgb/PDB-Fusion/tree/mai...
2,P52684|1,MKDDINQEITFRKLSVFMMFMAKGNIARTAEAMKLSSVSVHRALHT...,1,https://github.com/hfuulgb/PDB-Fusion/tree/mai...
3,P10961|1,MNNAANTGTTNESNVSDAPRIEPLPSLNDDDIEKILQPNDIFTTDR...,1,https://github.com/hfuulgb/PDB-Fusion/tree/mai...
4,P06023|1,MAKPAKRIKSAAAAYVPQNRDAVITDIKRIGDLQREASRLETEMND...,1,https://github.com/hfuulgb/PDB-Fusion/tree/mai...


Run conversion

In [25]:
# Run the script
cf.csv_to_fasta(csv_file, fasta_file)

Check if both files were created.

In [26]:
!tree ../../data/dna_binding -nDhL 1 --dirsfirst

../../data/dna_binding
├── [4.0K Sep 20 15:53]  esm
├── [4.0K Sep 20 16:00]  prose
├── [1.1M Dec  1  2021]  test.csv
├── [864K Oct  2 19:07]  test_esm.csv
├── [781K Oct  2 19:13]  test_esm.fa
├── [1.1M Oct  2 19:07]  test_prose.csv
├── [1.0M Oct  2 19:13]  test_prose.fa
├── [6.6M Dec  1  2021]  train.csv
├── [5.2M Oct  2 19:07]  train_esm.csv
├── [4.7M Oct  2 19:13]  train_esm.fa
├── [6.5M Oct  2 19:07]  train_prose.csv
└── [6.0M Oct  2 19:13]  train_prose.fa

2 directories, 10 files
