<a href="https://colab.research.google.com/github/tnahddisttud/BioInformaticsProject/blob/main/bioinfProject.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

>***Bioinformatics Project***
>>Authors: *Megha Patel, Siddhant Pandey*
---
#**Predicting Antimicrobial Peptides**
---

##**Installing Conda**

In [None]:
! wget https://repo.anaconda.com/miniconda/Miniconda3-py37_4.8.2-Linux-x86_64.sh
! chmod +x Miniconda3-py37_4.8.2-Linux-x86_64.sh
! bash ./Miniconda3-py37_4.8.2-Linux-x86_64.sh -b -f -p /usr/local
import sys
sys.path.append('/usr/local/lib/python3.7/site-packages/')

##**Installing Pfeature Library**

In [None]:
! wget https://github.com/raghavagps/Pfeature/raw/master/PyLib/Pfeature.zip

In [None]:
! unzip Pfeature.zip

Archive:  Pfeature.zip
replace __MACOSX/._Pfeature? [y]es, [n]o, [A]ll, [N]one, [r]ename: 

In [None]:
% cd Pfeature

In [None]:
! python setup.py install

##**Installing CD-HIT**

In [None]:
! conda install -c bioconda cd-hit -y

##**Load Peptide DataSet**

In [None]:
! wget https://raw.githubusercontent.com/dataprofessor/AMP/main/train_po.fasta

In [None]:
! wget https://raw.githubusercontent.com/dataprofessor/AMP/main/train_ne.fasta

In [None]:
! cat train_ne.fasta

## **Removing Redundant Sequences using CD-HIT**

In [None]:
! cd-hit -i train_po.fasta -o train_po_cdhit.txt -c 0.99

In [None]:
! cd-hit -i train_ne.fasta -o train_ne_cdhit.txt -c 0.99

In [None]:
! ls -l

In [None]:
! grep ">" train_po_cdhit.txt | wc -l

In [None]:
! grep ">" train_po.fasta | wc -l

In [None]:
! grep ">" train_ne.fasta | wc -l

In [None]:
! grep ">" train_ne_cdhit.txt | wc -l

# **Calculate features using the Pfeature library**

Feature classes provided by Pfeature is summarized in the tables below.

**Composition Based Features**

Feature class | Description | Function
---|---|---
AAC | Amino acid composition | aac_wp
DPC | Dipeptide composition | dpc_wp
TPC | Tripeptide composition | tpc_wp
ABC | Atom and bond composition | atc_wp, btc_wp
PCP | Physico-chemical properties | pcp_wp
AAI | Amino acid index composition | aai_wp
RRI | Repetitive Residue Information | rri_wp
DDR | Distance distribution of residues |ddr_wp
PRI | Physico-chemical properties repeat composition | pri_wp
SEP | Shannon entropy | sep_wp
SER | Shannon entropy of residue level | ser_wp
SPC | Shannon entropy of physicochemical property | spc_wp
ACR | Autocorrelation | acr_wp
CTC | Conjoint Triad Calculation | ctc_wp
CTD | Composition enhanced transition distribution | ctd_wp
PAAC | Pseudo amino acid composition | paac_wp
APAAC | Amphiphilic pseudo amino acid composition | apaac_wp
QSO | Quasi sequence order | qos_wp
SOC | Sequence order coupling | soc_wp

[Pfeature Manual](https://webs.iiitd.edu.in/raghava/pfeature/Pfeature_Manual.pdf)

# **Now, The Real Work Begins!🥳**
---

## **Introduction to Amino Acids**
---
Classification of Amino acids:


**Side-Chain based differences**

Name | 3-letter Code | 1-letter code | R-group
---|---|---|---
Glycine | Gly | G |![Glycine](https://upload.wikimedia.org/wikipedia/commons/thumb/3/3f/Glycine-2D-skeletal.png/180px-Glycine-2D-skeletal.png)
Alanine | Ala| A | -CH3
Leucine | Leu | L | -CH2
Isoleucine | Ile | I | -H 
Valine | Val | V | -H 
Phenylalanine | Phe | F | -H 
Tyrosine | Tyr | Y | -H 
Tryptophan| Trp | W | -H 
Methionine | Met | M | -H 
Cysteine | Cys | C | -H 
Aspartic acid | Asp | D | -H 
Glutamic acid | Glu | E | -H 
Histidine | His | H | -H 
Lysine | Lys | K | -H 
Arginine | Arg | R | -H 
Asparagine | Asn | N | -H 
Glutamine | Gln | O | -H 
Serine | Ser | S | -H 
Threonine | Thr | T | -H 
Proline | Pro | P | -H 

## **Working on Different Features**

In [None]:
import pandas as pd

In [None]:
# Atom and bond composition (ABC)

from Pfeature.pfeature import atc_wp

def atc(input):
  a = input.rstrip('txt')
  output = a + 'atc.csv'
  df_out = atc_wp(input, output)
  df_in = pd.read_csv(output)
  return df_in

feature = atc('train_po_cdhit.txt')
feature

### **Calculate feature for both positive and negative classes + combines the two classes + merge with class labels**

In [None]:
pos = 'train_po_cdhit.txt'
neg = 'train_ne_cdhit.txt'

def feature_calc(po, ne, feature_name):
  # Calculate feature
  po_feature = feature_name(po)
  ne_feature = feature_name(ne)
  # Create class labels
  po_class = pd.Series(['positive' for i in range(len(po_feature))])
  ne_class = pd.Series(['negative' for i in range(len(ne_feature))])
  # Combine po and ne
  po_ne_class = pd.concat([po_class, ne_class], axis=0)
  po_ne_class.name = 'class'
  po_ne_feature = pd.concat([po_feature, ne_feature], axis=0)
  # Combine feature and class
  df = pd.concat([po_ne_feature, po_ne_class], axis=1)
  return df

feature = feature_calc(pos, neg, atc) # ATC
print(feature)

 **Data pre-processing**

In [None]:
# Assigns the features to X and class label to Y
X = feature.drop('class', axis=1)
y = feature['class'].copy()

In [None]:
# Encoding the Y class label
y = y.map({"positive": 1, "negative": 0}) 

In [None]:
X.shape

In [None]:
# Feature selection (Variance threshold)
from sklearn.feature_selection import VarianceThreshold

fs = VarianceThreshold(threshold=0.1)
fs.fit_transform(X)
#X2.shape
X2 = X.loc[:, fs.get_support()]
X2

In [None]:
# Data split
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X2, y, test_size=0.2, random_state =42, stratify=y)

# **Quickly compare >30 ML algorithms**

In [None]:
! pip install lazypredict

In [None]:
# Import libraries
import lazypredict
from lazypredict.Supervised import LazyClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import matthews_corrcoef

# Load dataset
X = feature.drop('class', axis=1)
y = feature['class'].copy()

# Data split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state =42, stratify=y)

# Defines and builds the lazyclassifier
clf = LazyClassifier(verbose=0,ignore_warnings=True, custom_metric=matthews_corrcoef)
models_train,predictions_train = clf.fit(X_train, X_train, y_train, y_train)
#models_test,predictions_test = clf.fit(X_train, X_test, y_train, y_test)

In [None]:
# Prints the model performance (Training set)
models_train