<a href="https://colab.research.google.com/github/quickbrainlab/Project_2_Protein_Sequence_Classifier_with_ML/blob/main/Project_2_Protein_Sequence_Classifier_with_ML.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Step 1: Create a csv**
Open *Notepad* or *Google Sheets*, and paste the data in this format:


Sequence,Label
MVLSPADKTNVKAAWGKVGAHAGEYGAEALERMF,Non-Enzyme
MALWMRLLPLLALLALWGPDPAAA,Hormone
MKAKLLLAVVTLATARLSYVNQ,Enzyme
MKWVTFISLLFLFSSAYS,Enzyme
MKWVTFISLLFLFSSAYSRGVFRRDTHKSEIAHRFKDLGE,Non-Enzyme


👉 Save this as protein_data.csv.


**Step 2: Load the Dataset**

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix


df = pd.read_csv("protein_data.csv")
print(df.head())

                                   Sequence       Label
0        MVLSPADKTNVKAAWGKVGAHAGEYGAEALERMF  Non-Enzyme
1                  MALWMRLLPLLALLALWGPDPAAA     Hormone
2                    MKAKLLLAVVTLATARLSYVNQ      Enzyme
3                        MKWVTFISLLFLFSSAYS      Enzyme
4  MKWVTFISLLFLFSSAYSRGVFRRDTHKSEIAHRFKDLGE  Non-Enzyme


**Step 3: Preprocess the Data**
Convert amino acid sequences into numerical features.
You can start by encoding amino acids using frequency or one-hot encoding. Here’s a simple way to convert sequences into amino acid frequencies:

In [None]:
from collections import Counter
import pandas as pd

# Load the CSV file
df = pd.read_csv("protein_data.csv")

# Define function to convert each protein Sequence into amino acid frequency
def aa_frequency(Sequence):
    sequence = str(Sequence).upper()
    count = Counter(Sequence)
    total = len(Sequence)
    return {aa: count.get(aa, 0)/total for aa in 'ACDEFGHIKLMNPQRSTVWY'}

# Apply the function to each sequence
features_df = df['Sequence'].apply(aa_frequency).apply(pd.Series)

# Combine features with label column
final_df = pd.concat([features_df, df['Label']], axis=1)

# Show final processed data
print(final_df.head())

          A    C         D         E         F         G         H         I  \
0  0.205882  0.0  0.029412  0.088235  0.029412  0.117647  0.029412  0.000000   
1  0.250000  0.0  0.041667  0.000000  0.000000  0.041667  0.000000  0.000000   
2  0.181818  0.0  0.000000  0.000000  0.000000  0.000000  0.000000  0.000000   
3  0.055556  0.0  0.000000  0.000000  0.166667  0.000000  0.000000  0.055556   
4  0.050000  0.0  0.050000  0.050000  0.125000  0.050000  0.050000  0.050000   

          K         L  ...         N         P         Q         R         S  \
0  0.088235  0.058824  ...  0.029412  0.029412  0.000000  0.029412  0.029412   
1  0.000000  0.333333  ...  0.000000  0.125000  0.000000  0.041667  0.000000   
2  0.090909  0.227273  ...  0.045455  0.000000  0.045455  0.045455  0.045455   
3  0.055556  0.166667  ...  0.000000  0.000000  0.000000  0.000000  0.222222   
4  0.075000  0.100000  ...  0.000000  0.000000  0.000000  0.100000  0.125000   

          T         V         W       

**Step 4: Train-Test Split**

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report
# Split data into features and labels
X = final_df.drop('Label', axis=1)
y = final_df['Label']
# Split into training and testing datasets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train the model
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)
# Predict on the test set
y_pred = model.predict(X_test)

In [None]:
print("Model has been trained successfully.")

Model has been trained successfully.


**Evaluate The Model**

In [None]:
# Accuracy of the model
accuracy = accuracy_score(y_test, y_pred)
print("Model Accuracy:", accuracy)

# Classification Report
report = classification_report(y_test, y_pred)
print("Classification Report:\n", report)

Model Accuracy: 0.0
Classification Report:
               precision    recall  f1-score   support

     Hormone       0.00      0.00      0.00       1.0
  Non-Enzyme       0.00      0.00      0.00       0.0

    accuracy                           0.00       1.0
   macro avg       0.00      0.00      0.00       1.0
weighted avg       0.00      0.00      0.00       1.0



  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


**Bonus (Optional): Predict on a new sequence
If you want to classify a new protein sequence:**

In [None]:
# Example protein sequence
new_seq = "MVKVYAPASSANMSVGFDVLGAAVTPVDGALLGDVVTVEAAETFSLNNLGQK"  # Just an example

# Convert to amino acid frequency
new_features = pd.DataFrame([aa_frequency(new_seq)])

# Predict using trained model
prediction = model.predict(new_features)
print("Predicted class:", prediction[0])

Predicted class: Non-Enzyme


**Bonus Tip:
Save the trained model (if you want to use it later):**

In [None]:
import joblib
joblib.dump(model, "protein_classifier.pkl")

['protein_classifier.pkl']