2019-13773 Kyungjin Kim (@jamesk617)
2018-18574 Junyoung Park (@engineerA314)
2018-12018 Sungmin Song (@reverg)
We worked on promotor gene classifying task by using machinelearning method. CNN+GRU, CNN, GRU, windowGRU, DNABERT+1-linear, DNABERT+2-linear models are implemented. GRU models showed lowest accuracy and BERT based model showed meaningful increment.
As natural language process technology has shown remarkable progress, now it is applied in not only the human language, but also for many other sequence data such as DNA. DNA is the key-element of the Central Dogma principle in biology and it is also a core for many medical science task like immune profiling.
One of important DNA target is promotor gene. Promotor gene is area that RNA transcription factors gets bonded into, so it is necessary to reveal the pattern of it for DNA study. As so, it is representative example for DNA-NLP task.
Therefore we chose promotor gene classifying task as our project subject. We believe that these kind of experiences can help us to understand more deeply in how machine learning on sequence datas are progressed, and to be more flexible researchers so that we can engage into various tasks in the world.
Baseline model of our project’s reference was CNN+GRU. We implemented CNN+GRU model and CNN only, GRU only model and BERT for comparison. Plus, we also used few variations for each models for broad practicing.
Our first goal was to improve the accuracy of the task by finding other appropriate model and second goal was to practice various machine learning method by implementing codes.
How to run
- Upload the elements in 'CNN+GRU' directory to Google Drive.
- Run 'CNN+GRU.ipynb' with GPU mode in Colab
Result
Accuracy | Precision | Recall |
---|---|---|
0.876 | 0.864 | 0.893 |
How to run
- Run 'onlyCNN.ipynb'
Result
Accuracy | Precision | Recall |
---|---|---|
0.871 | 0.896 | 0.839 |
How to run
- Upload the elements in 'onlyGRU' directory to Google Drive.
- Run 'only_GRU_experiment.ipynb' with GPU mode in Colab
Result
Accuracy | Precision | Recall |
---|---|---|
0.824 | 0.793 | 0.876 |
How to run
- Upload the elements in 'windowGRU' directory to Google Drive.
- Run 'window+GRU_experiment.ipynb' with GPU mode in Colab
Result
Accuracy | Precision | Recall |
---|---|---|
0.731 | 0.716 | 0.765 |
How to run
- Upload the elements in 'DNABERT+1' directory to Google Drive.
- Run 'DNABERT+1_layer.ipynb' with GPU mode in Colab
Result
Accuracy | Precision | Recall |
---|---|---|
0.886 | 0.841 | 0.952 |
How to run
- Upload the elements in 'DNABERT+2' directory to Google Drive.
- Run 'DNABERT_torch_2.ipynb' with GPU mode in Colab
Result
Accuracy | Precision | Recall |
---|---|---|
0.901 | 0.922 | 0.876 |
- Pretrained Language Model can be applied on Promoter classification
- Outperformed the baseline, and possibility for better results
- CNN itself has a nice performance
- Good choice for limited computation source
- RNN structure is inappropriate
- Slow training + Low performance
- Low merit for combining with CNN