Automated-confirmation-of-protein-function-annotation-using-NLP

The logistic regression, support vector machine using linear kernel and RCNN using LSTM unit models are in Models directory.

RCNN code are modified from https://github.com/roomylee/rcnn-text-classification

Pretrained word embedding and sentence embedding models: https://github.com/ncbi-nlp/BioSentVec

Tensorflow:

pip install tensorflow==1.4

numpy version 1.15.2

More info on environment setup:

git clone https://github.com/epfml/sent2vec.git 
pip install Cython

  *   download the pretrained model BioSentVec: BioSentVec_PubMed_MIMICIII-bigram_d700.bin
  *   installed fasttext as below (fasttext was installed successfully)
  
$ wget https://github.com/facebookresearch/fastText/archive/v0.9.2.zip
$ unzip v0.9.2.zip
$ cd fastText-0.9.2
$ make

(pass_venv) jin@pass:/opt/pass/machine-learning-deploy$ ls
__pycache__      ensemble-eval.py  fastText-0.9.2  unit-classification.py
data_helpers.py  eval.py           sent2vec        v0.9.2.zip
(pass_venv) jin@pass:/opt/pass/machine-learning-deploy$ cd sent2vec/
(pass_venv) jin@pass:/opt/pass/machine-learning-deploy/sent2vec$ pip install .

If a later version of tensorflow is used, try:

import tensorflow.compat.v1 as tf

Model Training:

./cmd-lstm-ns

Alternatively, run the command below: python train-ns-title.py --cell_type "lstm" --pos_dir "data/title.pos-ns" --neg_dir "data/title.neg-ns" --word2vec "/home/paperspace/Documents/BioSentVec/BioWordVec_PubMed_MIMICIII_d200.vec.bin" --word_embedding_dim 200 --context_embedding_dim 150 --hidden_size 150

LSTM means using LSTM unit; "ns" stands for non-stemmed words; "pos-ns" stands for positive non-stemmed data; "neg-ns" stands for negative non-stemmed data. The directory of "word2vec" should be modified based on the actual directory of the pretrained "BioWordVec_PubMed_MIMICIII_d200.vec.bin".

Model Evaluation:

./runevalemsemble

Alternatively, run the command below: python ensemble-eval.py --pos_dir "data/title.pos-ns" --neg_dir "data/title.neg-ns" --batch_size 32 --checkpoint_dir "/home/paperspace/Documents/RCNN-421-BioSentVec/runs-ns/1587761106/checkpoints"

The directory to checkpoint in the command should be modified. Also, inside ensemble-eval.py, the directory of SVM and logistic regression models (which can be accessed in /Models) should be modified.

Uni-tests:

run ./unit_test

Alternatively, run:

python unit-classification.py --checkpoint_dir "/home/paperspace/Documents/RCNN-421-BioSentVec/runs-ns/1587761106/checkpoints" --unknown_dir "data/P71009-pub"

The content of data/P71009-pub can be checked in the data directory. P71009 is the entry identifier of the protein from Swiss-Prot. Note: remember change the model directories accordingly inside unit-classification.py

Example output: P71009-pub (3 publications)
rcnn:
[1 1 0]
logistic regression:
[1 1 1]
svm:
[0 1 0]
voting:
[2 3 1]
final:
[1 1 0]
Two of publications predicted as positive.

Name		Name	Last commit message	Last commit date
Latest commit History 40 Commits
Models		Models
__pycache__		__pycache__
data		data
data_collection_processing		data_collection_processing
README.md		README.md
cmd-gru		cmd-gru
cmd-gru-ns		cmd-gru-ns
cmd-lstm		cmd-lstm
cmd-lstm-ns		cmd-lstm-ns
cmd-vanilla		cmd-vanilla
cmd-vanilla-ns		cmd-vanilla-ns
data_helpers.py		data_helpers.py
ensemble-eval.py		ensemble-eval.py
eval.py		eval.py
ml-biosent2vec-422.ipynb		ml-biosent2vec-422.ipynb
rcnn.py		rcnn.py
runevalemsemble		runevalemsemble
train-ns-title.py		train-ns-title.py
train-title.py		train-title.py
unit-classification.py		unit-classification.py
unit-test		unit-test

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Automated-confirmation-of-protein-function-annotation-using-NLP

More info on environment setup:

Model Training:

Model Evaluation:

Uni-tests:

About

Releases

Packages

Languages

taojin1992/Automated-confirmation-of-protein-function-annotation-using-NLP

Folders and files

Latest commit

History

Repository files navigation

Automated-confirmation-of-protein-function-annotation-using-NLP

More info on environment setup:

Model Training:

Model Evaluation:

Uni-tests:

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages