Skip to content

reverg/NLP_project

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

31 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

2022 fall SNU NLP project

Member

2019-13773 Kyungjin Kim (@jamesk617)
2018-18574 Junyoung Park (@engineerA314)
2018-12018 Sungmin Song (@reverg)

Abstract

We worked on promotor gene classifying task by using machinelearning method. CNN+GRU, CNN, GRU, windowGRU, DNABERT+1-linear, DNABERT+2-linear models are implemented. GRU models showed lowest accuracy and BERT based model showed meaningful increment.

Motivation

As natural language process technology has shown remarkable progress, now it is applied in not only the human language, but also for many other sequence data such as DNA. DNA is the key-element of the Central Dogma principle in biology and it is also a core for many medical science task like immune profiling.

One of important DNA target is promotor gene. Promotor gene is area that RNA transcription factors gets bonded into, so it is necessary to reveal the pattern of it for DNA study. As so, it is representative example for DNA-NLP task.

Therefore we chose promotor gene classifying task as our project subject. We believe that these kind of experiences can help us to understand more deeply in how machine learning on sequence datas are progressed, and to be more flexible researchers so that we can engage into various tasks in the world.

Idea

Baseline model of our project’s reference was CNN+GRU. We implemented CNN+GRU model and CNN only, GRU only model and BERT for comparison. Plus, we also used few variations for each models for broad practicing.

Our first goal was to improve the accuracy of the task by finding other appropriate model and second goal was to practice various machine learning method by implementing codes.

Experiments & Result

1. Baseline (CNN + GRU)

How to run

  1. Upload the elements in 'CNN+GRU' directory to Google Drive.
  2. Run 'CNN+GRU.ipynb' with GPU mode in Colab

Result

  1. Loss graph per epoch
    CNN+GRU loss

  2. Accuracy graph per epoch
    CNN+GRU accuracy

  3. Result for Test Set

Accuracy Precision Recall
0.876 0.864 0.893

2. only CNN

How to run

  1. Run 'onlyCNN.ipynb'

Result

  1. Loss graph per epoch
    only CNN loss

  2. Accuracy graph per epoch
    only CNN accuracy

  3. Result for Test Set

Accuracy Precision Recall
0.871 0.896 0.839

3. only GRU

How to run

  1. Upload the elements in 'onlyGRU' directory to Google Drive.
  2. Run 'only_GRU_experiment.ipynb' with GPU mode in Colab

Result

  1. Loss graph per epoch
    only GRU loss

  2. Accuracy graph per epoch
    only GRU accuracy

  3. Result for Test Set

Accuracy Precision Recall
0.824 0.793 0.876

4. window GRU

How to run

  1. Upload the elements in 'windowGRU' directory to Google Drive.
  2. Run 'window+GRU_experiment.ipynb' with GPU mode in Colab

Result

  1. Loss graph per epoch
    window GRU loss

  2. Accuracy graph per epoch
    window GRU accuracy

  3. Result for Test Set

Accuracy Precision Recall
0.731 0.716 0.765

5. DNABERT + 1 Linear layer

How to run

  1. Upload the elements in 'DNABERT+1' directory to Google Drive.
  2. Run 'DNABERT+1_layer.ipynb' with GPU mode in Colab

Result

  1. Loss graph per epoch
    dnabert+1classifier loss

  2. Accuracy graph per epoch
    dnabert+1classifier metric

  3. Result for Test Set

Accuracy Precision Recall
0.886 0.841 0.952

6. DNABERT + 2 Linear layer

How to run

  1. Upload the elements in 'DNABERT+2' directory to Google Drive.
  2. Run 'DNABERT_torch_2.ipynb' with GPU mode in Colab

Result

  1. Loss graph per epoch
    dnabert+2 loss

  2. Accuracy graph per epoch
    dnabert+2 accuracy

  3. Result for Test Set

Accuracy Precision Recall
0.901 0.922 0.876

Conclusion

  • Pretrained Language Model can be applied on Promoter classification
    • Outperformed the baseline, and possibility for better results
  • CNN itself has a nice performance
    • Good choice for limited computation source
  • RNN structure is inappropriate
    • Slow training + Low performance
    • Low merit for combining with CNN