# Conversion of csv files to fasta files

We have received all files in the csv format. Both repositories with pretrained transformer models ([Facebook's Evolutionary Scale Modeling](https://github.com/facebookresearch/esm), [Protein Sequence Embeddings (ProSE)](V))  are using fasta formated files as input. That is why, in this notebook we will transfer all our csv files to fasta files. 

We have created the script *convert_csv_to_fasta.py* to help us with that.

In [1]:
# Import dependencies
import pandas as pd
#import numpy as np

In [2]:
# Import the script from different folder
import sys  
#sys.path.insert(1, '../scripts')
sys.path.append('../scripts')

import convert_csv_to_fasta as cf

## ACP - Anticancer Peptides

A new type of anticancer therapeutic agent, that have been selected as a safe drug.


ACP datasets do not have PBD Codes listed. The conversion script will insert sequential protein "names" needed for fasta format.

The format of the added names is <code>Protein_seq_xxxx</code> where <code>xxxx</code> is a sequential row number padded with zeros.

### Train Dataset

In [3]:
# Define paths
csv_file = '../data/acp/train_data.csv'
fasta_file = '../data/acp/train_data.fa'

In [4]:
# Create dataframe to see data format
df = pd.read_csv(csv_file)
print(df.shape)
df.head()

(1378, 2)


Unnamed: 0,sequences,label
0,RRWWRRWRRW,0
1,GWKSVFRKAKKVGKTVGGLALDHYLG,0
2,ALWKTMLKKLGTMALHAGKAALGAAADTISQGTQ,1
3,GLFDVIKKVAAVIGGL,1
4,VAKLLAKLAKKVL,1


Run conversion

In [5]:
# Run the script
cf.csv_to_fasta(csv_file, fasta_file)

### Test Dataset

In [6]:
# Define paths
csv_file = '../data/acp/test_data.csv'
fasta_file = '../data/acp/test_data.fa'

In [7]:
# Create dataframe to see data format
df = pd.read_csv(csv_file)
print(df.shape)
df.head()

(344, 2)


Unnamed: 0,sequences,label
0,FLPLLLSALPSFLCLVFKKC,0
1,DKLIGSCVWLAVNYTSNCNAECKRRGYKGGHCGSFLNVNCWCET,0
2,AVKDTYSCFIMRGKCRHECHDFEKPIGFCTKLNANCYM,0
3,GLPTCGETCFGGTCNTPGCTCDPWPVCTHN,1
4,ENCGRQAG,0


Run conversion

In [8]:
# Run the script
cf.csv_to_fasta(csv_file, fasta_file)

Check if both files were created.

In [10]:
!tree ../data/acp -DhL 1

[34;42m../data/acp[00m
├── [9.0K Dec  1  2021]  [01;32mtest_data.csv[00m
├── [ 14K Aug  9 09:44]  [01;32mtest_data.fa[00m
├── [ 36K Dec  1  2021]  [01;32mtrain_data.csv[00m
└── [ 58K Aug  9 09:44]  [01;32mtrain_data.fa[00m

0 directories, 4 files


## AMP - Antimicrobial Peptides

They have a wide range of inhibitory effects against bacteria, fungi, parasites, and viruses.



We have only one file for this group of peptides.

In [11]:
# Define paths
csv_file = '../data/amp/all_data.csv'
fasta_file = '../data/amp/all_data.fa'

In [12]:
# Create dataframe to see data format
df = pd.read_csv(csv_file)
print(df.shape)
df.head()

(4042, 3)


Unnamed: 0,PDBs_code,SequenceID,label
0,AP02484,GMASKAGSVLGKITKIALGAL,1
1,AP02630,NIGLFTSTCFSSQCFSSKCFTDTCFSSNCFTGRHQCGYTHGSC,1
2,AP01427,GAIKDALKGAAKTVAVELLKKAQCKLEKTC,1
3,AP02983,FFGRLKAVFRGARQGWKEHRY,1
4,AP01815,DFGCARGMIFVCMRRCARMYPGSTGYCQGFRCMCDTMIPIRRPPFIMG,1


Run conversion

In [13]:
# Run the script
cf.csv_to_fasta(csv_file, fasta_file)

Check if the file was created.

In [14]:
!tree ../data/amp -DhL 1

[34;42m../data/amp[00m
├── [193K Feb  3  2022]  [01;32mall_data.csv[00m
└── [190K Aug  9 10:03]  [01;32mall_data.fa[00m

0 directories, 2 files


## DNA-Binding proteins

These proteins have an important role in DNA replication, DNA methylation, gene expression, and other biological processes.


### Train Dataset

In [15]:
# Define paths
csv_file = '../data/dna_binding/train.csv'
fasta_file = '../data/dna_binding/train.fa'

In [16]:
# Create dataframe to see data format
df = pd.read_csv(csv_file)
print(df.shape)
df.head()

(14189, 4)


Unnamed: 0,code,sequence,label,origin
0,Q6A8L0,MSGHSKWATTKHKKAAIDAKRGKLFARLIKNIEVAARLGGGDPSGN...,1,https://github.com/hfuulgb/PDB-Fusion/tree/mai...
1,Q7V7T9,MIGWLQGQKVEAWQQGTRQGVVLACAGVGYEVQIAPRHLSEMEHGQ...,1,https://github.com/hfuulgb/PDB-Fusion/tree/mai...
2,Q9ZUP2,MARILRNVYSLRSSLFSSELLRRSVVGTSFQLRGFAAKAKKKSKSD...,1,https://github.com/hfuulgb/PDB-Fusion/tree/mai...
3,Q2JVG1,MKCPRCGKQEIRVLESRSAEGGQSVRRRRECMSCGYRFTTYERIEF...,1,https://github.com/hfuulgb/PDB-Fusion/tree/mai...
4,Q9K4Q3,MTKADIIEGVYEKVGFSKKESAEIVELVFDTLKETLERGDKIKISG...,1,https://github.com/hfuulgb/PDB-Fusion/tree/mai...


Run conversion

In [17]:
# Run the script
cf.csv_to_fasta(csv_file, fasta_file)

### Test Dataset

In [18]:
# Define paths
csv_file = '../data/dna_binding/test.csv'
fasta_file = '../data/dna_binding/test.fa'

In [19]:
# Create dataframe to see data format
df = pd.read_csv(csv_file)
print(df.shape)
df.head()

(2272, 4)


Unnamed: 0,code,sequence,label,origin
0,P27204|1,AKKRSRSRKRSASRKRSRSRKRSASKKSSKKHVRKALAAGMKNHLL...,1,https://github.com/hfuulgb/PDB-Fusion/tree/mai...
1,P53528|1,MVMVVNPLTAGLDDEQREAVLAPRGPVCVLAGAGTGKTRTITHRIA...,1,https://github.com/hfuulgb/PDB-Fusion/tree/mai...
2,P52684|1,MKDDINQEITFRKLSVFMMFMAKGNIARTAEAMKLSSVSVHRALHT...,1,https://github.com/hfuulgb/PDB-Fusion/tree/mai...
3,P10961|1,MNNAANTGTTNESNVSDAPRIEPLPSLNDDDIEKILQPNDIFTTDR...,1,https://github.com/hfuulgb/PDB-Fusion/tree/mai...
4,P06023|1,MAKPAKRIKSAAAAYVPQNRDAVITDIKRIGDLQREASRLETEMND...,1,https://github.com/hfuulgb/PDB-Fusion/tree/mai...


Run conversion

In [20]:
# Run the script
cf.csv_to_fasta(csv_file, fasta_file)

Check if both files were created.

In [21]:
!tree ../data/dna_binding -DhL 1

[34;42m../data/dna_binding[00m
├── [1.1M Dec  1  2021]  [01;32mtest.csv[00m
├── [1.0M Aug  9 10:15]  [01;32mtest.fa[00m
├── [6.6M Dec  1  2021]  [01;32mtrain.csv[00m
└── [6.0M Aug  9 10:13]  [01;32mtrain.fa[00m

0 directories, 4 files
