# Imports

In [14]:
import csv
from  utils import read_txt_file, write_csv_file
import pandas as pd

# Materials and Methods

## Datasets

The dataset **RB198** was employed as the training set, while **RB111** served as the independent set in this implementation.

 Both datasets were acquired from the following source: http://ailab-projects2.ist.psu.edu/RNABindRPlus/data.html

Originally, the data was structured as a text file in the following format:

```
#First line: PDBID and Chain ID
#Second line: Sequence
#Third line: Interface residues defined using a 5.0 angstrom distance cut-off
2XFZ_Y
KGFKDYGHDYHPAPKTENIKGLGDLKPGIPKTPKQNGGGKRKRWTGDKGRKIYEWDSQAGELEGYRASDGQHLGSFDPKTGNQLKGPDPKRNIKKYL
0000000011100000000000000000111011111111110000010110000111100000010110000000000000000000111101111
```

We needed to convert this data into a CSV file with the following features:

**PDBID**, **ChainID**, **Sequence**, **Interface**

The script processed this transformation.

In [13]:
rb198txt = 'Datasets/RB198.txt'
rb198 = 'Datasets/RB198.csv'

data = read_txt_file(rb198txt)
write_csv_file(data, rb198)

rb111txt = 'Datasets/RB111.txt'
rb111 = 'Datasets/RB111.csv'
data = read_txt_file(rb111txt)
write_csv_file(data, rb111)

The structure of the dataframe resembles:

In [15]:
train_data = pd.read_csv('Datasets/RB198.csv')
train_data.head(3)

Unnamed: 0,PDBID,ChainID,Sequence,Interface
0,2AZ0,A,MPSKLALIQELPDRIQTAVEAAMGMSYQDAPNNVRRDLDNLHACLN...,0000000000000000000000000000000010011001001100...
1,1M8V,A,GAMAERPLDVIHRSLDKDVLVILKKGFEFRGRLIGYDIHLNVVLAD...,0001110110010010000000000000000001111110000000...
2,2PJP,A,FSEEQQAIWQKAEPLFGDEPWWVRDLAKETGTDEQAMRLTLRQAAQ...,0000000000000000000001110000000001000100000000...


## Methodology

In this implementation, the proposed PRIP method comprised five steps:

1. Pre-training the Word2vec model.
2. Dividing protein sequences.
3. Extracting semantic features.
4. Training the XGBoost classifier.
5. Discerning between binding and non-binding sites.


## Word2Vec

In [None]:
import pandas as pd
from gensim.models import Word2Vec
from gensim.models.word2vec import PathLineSentences

# Load the RB198 dataset
rb198_path = 'your_path_here/RB198.csv'  # Replace with your actual path
rb198_data = pd.read_csv(rb198_path)

# Tokenize the sequences into "words" (here, each amino acid is considered a word)
tokenized_sequences = [list(sequence) for sequence in train_data['Sequence']]

# Define the Word2vec model with the specified parameters
model = Word2Vec(sentences=tokenized_sequences,
                vector_size=25,           # Dimensionality of the word vectors
                window=5,                 # Maximum distance between the current and predicted word within a sentence
                min_count=1,              # Ignores all words with total frequency lower than this
                sg=0,                     # Use CBOW model
                negative=5,               # Number of negative samples
                epochs=200,               # Number of iterations (epochs) over the corpus
                workers=1)                # Number of worker threads

# Save the model for later use or to load it in another environment
model.save('/mnt/data/word2vec_protein_sequences.model')

# Example usage: Getting the vector for a specific amino acid
vector_for_amino_acid_A = model.wv['A']
print(vector_for_amino_acid_A)
