# Example using PDB Autofill

## Overall Function

Input the file directory, protein names, and the number in list of proteins you are interested to classify. The output will give you the list of features that are most responsible for the missing residues. A higher node number corresponds to a higher correlation of why the protein has missing residues. 

In [1]:
import pdb_autofill as pdb
import warnings
warnings.filterwarnings("ignore")

ModuleNotFoundError: No module named 'pdb_autofill'

In [2]:
protein_list = ['1gey', '1gzc', '1ufo', '2hi2', '2jfk', '2o73', '2qqt', '2x82', '2y39', '2z91', '3d5m', 
                '3gem', '3qun', '3ueo', '4bal', '4msw', '4wji', '4y79', '5uez', '7fdr']
pdb.pdb_autofill('small_data/', protein_list, 5)

Does this circumstance tend to have missing residues? [ True]
Most used features of nodes that input data went through [('Nonpolar Side Chains', 1022), ('b_factor_gt50', 859), ('Sequence Length', 796), ('b_factor_max', 784), ('resolution', 722), ('Electrically Charged', 706), ('Hydrophobic', 583), ('Special', 486)]


## Loading Sample Dataset

If you are unfamiliar with PDB files, a good place to start is by downloading a subset we have prepared for you! `import datashare` and utilize the function `get_samples()` to load our subset. 

In [3]:
import datashare

In [None]:
datashare.getdata.get_samples()

## Extract DataFrames

This example shows how you can extract amino acid sequences and some additional features from PDB files using our package.

In [4]:
import dataprocess.PDB_Data_Processing as pdb_dp

In [5]:
[residues, headers] = pdb_dp.extraction_residues_headers('small_data/', protein_list)

In [6]:
residues.head() # residues is a dataframe with pdb files as column names, and amino acid sequence along the column

Unnamed: 0,1gey,1gzc,1ufo,2hi2,2jfk,2o73,2qqt,2x82,2y39,2z91,3d5m,3gem,3qun,3ueo,4bal,4msw,4wji,4y79,5uez,7fdr
0,THR,VAL,ARG,THR,GLN,ASP,SER,PRO,GLY,GLN,SER,SER,SER,GLY,ALA,GLN,ALA,ILE,SER,ALA
1,ILE,GLU,VAL,LEU,SER,ILE,TRP,VAL,ASP,LEU,MET,ALA,GLY,LEU,THR,VAL,GLN,VAL,HIS,PHE
2,THR,THR,ARG,ILE,MET,ASN,GLU,GLN,LEU,LEU,SER,PRO,LEU,PHE,PHE,GLN,GLN,GLY,MET,VAL
3,ASP,ILE,THR,GLU,ARG,VAL,VAL,HIS,HIS,GLU,TYR,ILE,VAL,SER,GLU,LEU,PHE,GLY,GLU,VAL
4,LEU,SER,GLU,LEU,LEU,VAL,GLY,VAL,GLU,SER,THR,LEU,PRO,GLN,ILE,GLN,GLN,GLN,GLN,THR


In [7]:
features = pdb_dp.extracted_features('small_data/', protein_list)

In [8]:
features['Protein'] = features['protein']
features.to_csv('RF_input.csv')
features.head()

Unnamed: 0,protein,name,head,structure_method,resolution,has_missing_residues,b_factor_avg,b_factor_med,b_factor_max,b_factor_gt50,Sequence Length,Electrically Charged,Nonpolar Side Chains,Hydrophobic,Special,Protein
0,1gey,crystal structure of histidinol-phosphate ami...,transferase,x-ray diffraction,2.3,True,23.901636,20.17,87.44,190,335,0.21194,0.2,0.453731,0.107463,1gey
1,1gzc,high-resolution crystal structure of erythrin...,sugar binding protein,x-ray diffraction,1.58,False,27.355023,24.68,67.17,31,239,0.175732,0.280335,0.401674,0.142259,1gzc
2,1ufo,crystal structure of tt1662 from thermus ther...,hydrolase,x-ray diffraction,1.6,False,18.072573,14.57,69.78,49,1404,0.277778,0.0769231,0.448718,0.196581,1ufo
3,2hi2,crystal structure of native neisseria gonorrh...,cell adhesion,x-ray diffraction,2.3,False,42.820564,41.33,131.14,368,157,0.261146,0.216561,0.407643,0.101911,2hi2
4,2jfk,structure of the mat domain of human fas with...,transferase,x-ray diffraction,2.4,True,31.797363,31.36,87.42,424,1616,0.230198,0.169554,0.430074,0.150371,2jfk


## Classify Missing Densities

This example shows how to use the classification feature. The input is a csv of extracted features shown previously, and the output is the ranked reason of given features. The higher number of nodes indicates a more significant impact for missind residues.

In [9]:
import randomforest.Random_Forest as RF

In [10]:
RF_input = 'RF_input.csv'

In [11]:
RF.get_reason_prediction(RF_input, 5)

Does this circumstance tend to have missing residues? [ True]
Most used features of nodes that input data went through [('Nonpolar Side Chains', 1022), ('b_factor_gt50', 859), ('Sequence Length', 796), ('b_factor_max', 784), ('resolution', 722), ('Electrically Charged', 706), ('Hydrophobic', 583), ('Special', 486)]


array([ True])

## Predict Missing Densities

We did not have enough time to finish this, but we would have demonstrated how to predict missing densities with our package here :)

In [12]:
import neural_networks.nnm_coordinate as nnm

Using TensorFlow backend.
