Setups

Environment:

CPU: Intel 8- Core i7-11800

GPU: RTX 3060

cudnn-11.2-windows-x64-v8.1.0.77

cuda_11.2.1_461.09_win10

External libraries:

tensorflow-gpu==2.5.0

keras

transformers

matplotlib

pandas

sklearn

sentencepiece

1. Brief introduction

Sentiment levels of movie comments can be scored by sentences or phrases. The purpose of this project is to explore the inference power of four transformer-based models which have been pre-trained on self-supervised learning tasks.

Specifically, in this work, the split phrases are classified into five levels of emotions by transfer learning with BERT, RoBERTa, DistilBERT and XLNet.

Click Kaggle Competition Link for more details.

2. Organization of the files

2.1. Dataset

File "./Datasets/train.csv" :

Even though the train/test split has been preserved for the purposes of benchmarking, the labels of the test set are not available. Therefore, the original training dataset with 156,060 phrases was divided into a new training set, validation set, and testing set in terms of proportion 8:1:1. The restructured training set has 140,454 phrases.

The sentiment labels and numbers are within restructured training set:

0 - negative-6371

1 - somewhat negative-24545

2 - neutral-71624

3 - somewhat positive-29626

4 - positive-8288

2.2. Folder "Function"

This folder divides the entire project process into data preprocessing, model, train, and test procedures, waiting for calling by "main.py".

All functions within this folder with detailed comments

2.2.1. File "data_preprocessing.py"

Split into the train(validation within it) and test
Extract useful features and labels
Set your model output as categorical and save it in the new label column
Split the training set into training and validation dataset
tokenizer function (in "train.py")

Return the training, validation and testing datasets.

2.2.2. File "model.py"

Define model structures of BERT, RoBERTa, DistilBERT and XLNet.

2.2.3. File "train.py"

Define the training pipelines with an optimizer, loss, and other hyper-parameters adjustable.

Return the model after training and logs history to analyze training convergence.

2.2.4. File "test.py"

This part evaluates the model performance on the testing dataset containing:

Classification Reports.
Confusion Matrix figures.
Multi-class ROC figures.
Globally micro ROC figures.

2.3. Other Folders

Folder "images" save the training and testing figures, as well as model structures.

Folder "model_logs" saves the training logs history.

Folder "model_trained" saves the model after training.

Folder "submission" saves the .csv file using the same format as the competition.

3. Run the code

3.1. Run the code on "main.py"

Notes. It will run all steps of this project by calling the functions in the folder "function".

3.2. Run the code on "main.ipynb"

Notes. This contains more detailed work than in "main.py", including the EDA process at the beginning, open in Jupyter for running step by step.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Setups

Environment:

External libraries:

1. Brief introduction

2. Organization of the files

2.1. Dataset

2.2. Folder "Function"

2.2.1. File "data_preprocessing.py"

2.2.2. File "model.py"

2.2.3. File "train.py"

2.2.4. File "test.py"

2.3. Other Folders

3. Run the code

3.1. Run the code on "main.py"

3.2. Run the code on "main.ipynb"

All trained models employed in the report can be obtained in the GOOGLE DRIVE link

DO NOT CHANGE THE FILE LEVEL

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
Datasets		Datasets
function		function
images		images
model_logs		model_logs
submission		submission
LICENSE		LICENSE
README.md		README.md
main.ipynb		main.ipynb
main.py		main.py

License

wxy12151/AMLSII_21-22_SN21054489

Folders and files

Latest commit

History

Repository files navigation

Setups

Environment:

External libraries:

1. Brief introduction

2. Organization of the files

2.1. Dataset

2.2. Folder "Function"

2.2.1. File "data_preprocessing.py"

2.2.2. File "model.py"

2.2.3. File "train.py"

2.2.4. File "test.py"

2.3. Other Folders

3. Run the code

3.1. Run the code on "main.py"

3.2. Run the code on "main.ipynb"

All trained models employed in the report can be obtained in the GOOGLE DRIVE link

DO NOT CHANGE THE FILE LEVEL

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages