2022 fall SNU NLP project

Member

2019-13773 Kyungjin Kim (@jamesk617)
2018-18574 Junyoung Park (@engineerA314)
2018-12018 Sungmin Song (@reverg)

Abstract

We worked on promotor gene classifying task by using machinelearning method. CNN+GRU, CNN, GRU, windowGRU, DNABERT+1-linear, DNABERT+2-linear models are implemented. GRU models showed lowest accuracy and BERT based model showed meaningful increment.

Motivation

As natural language process technology has shown remarkable progress, now it is applied in not only the human language, but also for many other sequence data such as DNA. DNA is the key-element of the Central Dogma principle in biology and it is also a core for many medical science task like immune profiling.

One of important DNA target is promotor gene. Promotor gene is area that RNA transcription factors gets bonded into, so it is necessary to reveal the pattern of it for DNA study. As so, it is representative example for DNA-NLP task.

Therefore we chose promotor gene classifying task as our project subject. We believe that these kind of experiences can help us to understand more deeply in how machine learning on sequence datas are progressed, and to be more flexible researchers so that we can engage into various tasks in the world.

Idea

Baseline model of our project’s reference was CNN+GRU. We implemented CNN+GRU model and CNN only, GRU only model and BERT for comparison. Plus, we also used few variations for each models for broad practicing.

Our first goal was to improve the accuracy of the task by finding other appropriate model and second goal was to practice various machine learning method by implementing codes.

Experiments & Result

1. Baseline (CNN + GRU)

How to run

Upload the elements in 'CNN+GRU' directory to Google Drive.
Run 'CNN+GRU.ipynb' with GPU mode in Colab

Result

Loss graph per epoch
Accuracy graph per epoch
Result for Test Set

Accuracy	Precision	Recall
0.876	0.864	0.893

2. only CNN

How to run

Run 'onlyCNN.ipynb'

Result

Loss graph per epoch
Accuracy graph per epoch
Result for Test Set

Accuracy	Precision	Recall
0.871	0.896	0.839

3. only GRU

How to run

Upload the elements in 'onlyGRU' directory to Google Drive.
Run 'only_GRU_experiment.ipynb' with GPU mode in Colab

Result

Loss graph per epoch
Accuracy graph per epoch
Result for Test Set

Accuracy	Precision	Recall
0.824	0.793	0.876

4. window GRU

How to run

Upload the elements in 'windowGRU' directory to Google Drive.
Run 'window+GRU_experiment.ipynb' with GPU mode in Colab

Result

Loss graph per epoch
Accuracy graph per epoch
Result for Test Set

Accuracy	Precision	Recall
0.731	0.716	0.765

5. DNABERT + 1 Linear layer

How to run

Upload the elements in 'DNABERT+1' directory to Google Drive.
Run 'DNABERT+1_layer.ipynb' with GPU mode in Colab

Result

Loss graph per epoch
Accuracy graph per epoch
Result for Test Set

Accuracy	Precision	Recall
0.886	0.841	0.952

6. DNABERT + 2 Linear layer

How to run

Upload the elements in 'DNABERT+2' directory to Google Drive.
Run 'DNABERT_torch_2.ipynb' with GPU mode in Colab

Result

Loss graph per epoch
Accuracy graph per epoch
Result for Test Set

Accuracy	Precision	Recall
0.901	0.922	0.876

Conclusion

Pretrained Language Model can be applied on Promoter classification
- Outperformed the baseline, and possibility for better results
CNN itself has a nice performance
- Good choice for limited computation source
RNN structure is inappropriate
- Slow training + Low performance
- Low merit for combining with CNN

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

2022 fall SNU NLP project

Member

Abstract

Motivation

Idea

Experiments & Result

1. Baseline (CNN + GRU)

2. only CNN

3. only GRU

4. window GRU

5. DNABERT + 1 Linear layer

6. DNABERT + 2 Linear layer

Conclusion

About

Packages

Contributors 3

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
CNN+GRU		CNN+GRU
DNABERT+1		DNABERT+1
DNABERT+2		DNABERT+2
onlyCNN		onlyCNN
onlyGRU		onlyGRU
windowGRU		windowGRU
README.md		README.md
presentation.pdf		presentation.pdf

reverg/NLP_project

Folders and files

Latest commit

History

Repository files navigation

2022 fall SNU NLP project

Member

Abstract

Motivation

Idea

Experiments & Result

1. Baseline (CNN + GRU)

2. only CNN

3. only GRU

4. window GRU

5. DNABERT + 1 Linear layer

6. DNABERT + 2 Linear layer

Conclusion

About

Topics

Resources

Stars

Watchers

Forks

Packages 0

Contributors 3

Languages

Packages