Skip to content
This repository has been archived by the owner on Nov 29, 2023. It is now read-only.

trhgquan/nlp-getting-started

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 

Repository files navigation

Natural Language Processing with Disaster Tweets

Natural Language Processing with Disaster Tweets

This project is licensed under The GNU GPL v3

Notebooks

Please have a look at the version history of each notebook.

Statistical models:

Deep learning models:

Code

View training & testing script's help with this command:

python <script>.py --help

Note: use those scripts at your own risk, since I don't normally re-train models on my personal PC.

Baselines

Text preprocessing

Different text preprocessing methods used in my implementations, but most methods following these steps

  • Removing emojis
  • Removing html
  • Removing URLs
  • Removing punctuations
  • Lowercase and remove multiple spaces.

However there are some exceptions where a specific preprocessing method of the pretrained model is applied:

Training configurations

Statistical models

  • Training data: Using full training set.
  • Hyperparameters: using sklearn.model_selection.GridSearchCV to automatically pick best combinations.

Deep learning models

RNNs
Click to view
Hyperparameter Value
Train:test 8:2
Batch size (train/test) 64/32
Learning rate 1e-4
Embedding dim 64
Epochs 10
Vocab size 10000
CNNs
Click to view
Hyperparameter Value
Train:test 8:2
Batch size (Train/test) 64/32
Filter size 100
Window size [3, 4, 5]
L2 regularization 3
Dropout rate 0.5
Dense unit 64
Learning rate 1e-4
Epochs 100
Vocab size 10000
Early stopping 20 epochs
Classification threshold 0.5
CNN & RNN
Click to view

CNN & RNN feed model:

Hyperparameter Value
Train:test 8:2
Batch size (Train/test) 64/32
Recurrent units 512
Filter size 200
Window size [1, 2, 3]
Dropout rate 0.5
Dense unit 64
Learning rate 1e-4
Epochs 100
Vocab size 10000
Early stopping 20 epochs
Classification threshold 0.5

CNN & BiRNN feed model:

Hyperparameter Value
Train:test 8:2
Batch size (Train/test) 64/32
Recurrent units 512
Filter size 200
Window size [1, 2, 3]
Dropout rate 0.5
Dense unit 64
Learning rate 1e-4
Epochs 100
Vocab size 10000
Early stopping 10 epochs
Classification threshold 0.5

CNN & RNN concat model:

Hyperparameter Value
Train:test 8:2
Batch size (Train/test) 64/32
Recurrent units 512
Filter size 200
Window size [1, 2, 3]
Dropout rate 0.5
Dense unit 64
Learning rate 1e-4
Epochs 100
Vocab size 10000
Early stopping 5 epochs
Classification threshold 0.5

CNN & BiRNN concat model:

Hyperparameter Value
Train:test 8:2
Batch size (Train/test) 64/32
Recurrent units 512
Filter size 200
Window size [1, 2, 3]
Dropout rate 0.5
Dense unit 64
Learning rate 1e-4
Epochs 100
Vocab size 10000
Early stopping 10 epochs
Classification threshold 0.5

LLMs

Click to view
Hyperparameter Value
Train:dev:test ratio 6:2:2
Batch size 64
Learning rate 2e-5
Weight decay 0.01
Epochs 50
Early stopping 5 epochs

Too-large LLMs

Click to view

Some large LLMs cannot be trained with hyperparameters in the LLMs section. In order to fit those models to Kaggle GPU's RAM, I reduced the batch size and learning rate to following values:

Hyperparameter Value
Train:dev:test ratio 6:2:2
Batch size 32
Learning rate 1e-5
Weight decay 0.01
Epochs 50
Early stopping 5 epochs

All remaining hyperparametes stay the same as LLMs.

Results

Experiment setup: All experiments were conducted under the same Kaggle environment:

Configuration Value
CPU Intel Xeon 2.20 GHz CPU, 4vCPU cores
Memory 32 GB
GPU NVIDIA Tesla T4 (x2) (LLMs) or P100 (RNNs, CNNs)
Random seed 42

wandb.ai report

Statistical models

Click to view
Model Vectorizer Training configurations Public F1
KMean TFIDF [1] 0.50658
Linear Models Logistic Regression TFIDF [1] 0.80171
Stochastic Gradient Descent TFIDF [1] 0.80386
Support Vector Machine TFIDF [1] 0.80140
Random Forest TFIDF [1] 0.78792
AdaBoost Decision Tree TFIDF [1] 0.72847
Bagging Decision Tree TFIDF [1] 0.74348
Decision Tree TFIDF [1] 0.71069
Gradient Boosting TFIDF [1] 0.73889
XGBoost TFIDF [1] 0.74992
Naive Bayes Multinomial Naive Bayes TFIDF [1] 0.80447
Complement Naive Bayes TFIDF [1] 0.79589
Multilayer Perceptrons TFIDF [1] 0.75911

Deep learning models

RNNs, CNNs and ensemble models
Click to view
Model (with paper link) Pretrain parameters Training configurations Public F1 Notes
RNN 1-layer Bidirectional LSTM 714,369 [3] 0.77352
2-layers stacked Bidirectional LSTM 751,489 [3] 0.78026
1-layer Bidirectional GRU 698,241 [3] 0.77536
2-layers stacked Bidirectional GRU 725,249 [3] 0.77566
RNN + Attention 1-layer Bidirectional LSTM + Dot Attention 714,369 [3] 0.76892
1-layer Bidirectional GRU + Dot Attention 698,241 [3] 0.78516
1-layer Bidirectional LSTM + General Attention 730,881 [3] 0.77995
1-layer Bidirectional GRU + General Attention 714,753 [3] 0.77719
1-layer Bidirectional LSTM + Concatenate Attention 730,946 [3] 0.78148
1-layer Bidirectional GRU + Concatenate Attention 714,818 [3] 0.77873
Deep CNN (random + pretrained embedding) CNN non-static (random embedding) 299,629 [3] 0.71345 Embedding dimension = 25 (equals to GloVe vector size)
CNN static (glove-twitter-25) 299,629 [3] 0.77689
CNN static (glove-twitter-50) 579,629 [3] 0.78700
CNN static (glove-twitter-100) 1,139,629 [3] 0.79374
CNN static (glove-twitter-200) 2,259,629 [3] 0.79711
CNN static (fasttext-wiki-news-subwords-300) 3,379,629 [3] 0.57033
CNN non-static (glove-twitter-25) 299,629 [3] 0.80478
CNN non-static (glove-twitter-50) 579,629 [3] 0.79619
CNN non-static (glove-twitter-100) 1,139,629 [3] 0.79987
CNN non-static (glove-twitter-200) 2,259,629 [3] 0.80140
CNN non-static (fasttext-wiki-news-subwords-300) 3,379,629 [3] 0.73980
Multi-channel CNN and RNN Random embedding (static) + Unidirectional LSTM 3,326,169 [3] 0.67391
Random embedding (static) + Bidirectional LSTM 4,411,609 [3] 0.68709
Random embedding (static) + Unidirectional GRU (todo) [3] (todo)
Random embedding (static) + Bidirectional GRU (todo) [3] (todo)
GloVe (glove-twitter-25, static) + Unidirectional LSTM 1,366,169 [3] 0.68372
GloVe (glove-twitter-25, static) + Bidirectional LSTM 2,451,609 [3] 0.78976
GloVe (glove-twitter-50, static) + Unidirectional LSTM 1,646,169 [3] 0.77781
GloVe (glove-twitter-50, static) + Bidirectional LSTM 2,731,609 [3] 0.78148
GloVe (glove-twitter-100, static) + Unidirectional LSTM 2,206,169 [3] 0.73460
GloVe (glove-twitter-100, static) + Bidirectional LSTM 3,291,609 [3] 0.78700
GloVe (glove-twitter-200, static) + Unidirectional LSTM 3,326,169 [3] 0.71835
GloVe (glove-twitter-200, static) + Bidirectional LSTM 4,411,609 [3] 0.76310
Random embedding (nonstatic) + Unidirectional LSTM 3,326,169 [3] 0.71284
Random embedding (nonstatic) + Bidirectional LSTM 4,411,609 [3] 0.75390
Random embedding (nonstatic) + Unidirectional GRU (todo) [3] (todo)
Random embedding (nonstatic) + Bidirectional GRU (todo) [3] (todo)
GloVe (glove-twitter-25, nonstatic) + Unidirectional LSTM 1,366,169 [3] 0.75942
Glove (glove-twitter-25, nonstatic) + Bidirectional LSTM 2,451,609 [3] 0.79436
GloVe (glove-twitter-50, nonstatic) + Unidirectional LSTM 1,646,169 [3] 0.78240
GloVe (glove-twitter-50, nonstatic) + Bidirectional LSTM 2,731,609 [3] 0.79957
GloVe (glove-twitter-100, nonstatic) + Unidirectional LSTM 2,206,169 [3] 0.78700
GloVe (glove-twitter-100, nonstatic) + Bidirectional LSTM 3,291,609 [3] 0.76064
GloVe (glove-twitter-200, nonstatic) + Unidirectional LSTM 3,326,169 [3] 0.78179
GloVe (glove-twitter-200, nonstatic) + Bidirectional LSTM 4,411,609 [3] 0.77474
Multi-channel CNN and RNN (concat) Random embedding (static) + Unidirectional LSTM 3,772,121 [3] 0.78394 Embedding dimension = 200
Random embedding (static) + Bidirectional LSTM 5,265,113 [3] 0.78700
Random embedding (static) + Unidirectional GRU 3,408,601 [3] 0.78302
Random embedding (static) + Bidirectional GRU 4,538,073 [3] 0.77627
GloVe (glove-twitter-25, static) + Unidirectional LSTM 1,453,721 [3] 0.80110
GloVe (glove-twitter-25, static) + Bidirectional LSTM 2,588,313 [3] 0.79436
GloVe (glove-twitter-25, static) + Unidirectional GRU 1,179,801 [3] 0.80294
GloVe (glove-twitter-25, static) + Bidirectional GRU 2,040,473 [3] 0.79528
GloVe (glove-twitter-50, static) + Unidirectional LSTM 1,784,921 [3] 0.81091
GloVe (glove-twitter-50, static) + Bidirectional LSTM 2,970,713 [3] 0.81366
GloVe (glove-twitter-50, static) + Unidirectional GRU 1,498,201 [3] 0.80907
GloVe (glove-twitter-50, static) + Bidirectional GRU 2,397,273 [3] 0.80937
GloVe (glove-twitter-100, static) + Unidirectional LSTM 2,447,321 [3] 0.80539
GloVe (glove-twitter-100, static) + Bidirectional LSTM 3,735,513 [3] 0.81305
GloVe (glove-twitter-100, static) + Unidirectional GRU (todo) [3] (todo)
GloVe (glove-twitter-100, static) + Bidirectional GRU 3,110,873 [3] 0.80907
GloVe (glove-twitter-200, static) + Unidirectional LSTM 3,772,121 [3] 0.80723
GloVe (glove-twitter-200, static) + Bidirectional LSTM 5,265,113 [3] 0.81152
GloVe (glove-twitter-200, static) + Unidirectional GRU 3,408,601 [3] 3,408,601
GloVe (glove-twitter-200, static) + Bidirectional GRU 4,538,073 [3] 0.80815
Random embedding (nonstatic) + Unidirectional LSTM 3,772,121 [3] 0.74164
Random embedding (nonstatic) + Bidirectional LSTM 5,265,113 [3] 0.77444
Random embedding (nonstatic) + Unidirectional GRU 3,408,601 [3] 0.80171
Random embedding (nonstatic) + Bidirectional GRU 4,538,073 [3] 0.80049
GloVe (glove-twitter-25, nonstatic) + Unidirectional LSTM 1,453,721 [3] 0.80876
GloVe (glove-twitter-25, nonstatic) + Bidirectional LSTM 2,588,313 [3] 0.79834
GloVe (glove-twitter-25, nonstatic) + Unidirectional GRU 1,179,801 [3] 0.80815
GloVe (glove-twitter-25, nonstatic) + Bidirectional GRU 2,040,473 [3] 0.79650
GloVe (glove-twitter-50, nonstatic) + Unidirectional LSTM 1,784,921 [3] 0.80539
GloVe (glove-twitter-50, nonstatic) + Bidirectional LSTM 2,970,713 [3] 0.81213
GloVe (glove-twitter-50, nonstatic) + Unidirectional GRU 1,498,201 [3] 0.80968
GloVe (glove-twitter-50, nonstatic) + Bidirectional GRU 2,397,273 [3] 0.80386
GloVe (glove-twitter-100, nonstatic) + Unidirectional LSTM 2,447,321 [3] 0.81029
GloVe (glove-twitter-100, nonstatic) + Bidirectional LSTM 3,735,513 [3] 0.80968
GloVe (glove-twitter-100, nonstatic) + Unidirectional GRU 2,135,001 [3] 0.80570
GloVe (glove-twitter-100, nonstatic) + Bidirectional GRU 3,110,873 [3] 0.80815
GloVe (glove-twitter-200, nonstatic) + Unidirectional LSTM 3,772,121 [3] 0.80508
GloVe (glove-twitter-200, nonstatic) + Bidirectional LSTM 5,265,113 [3] 0.81182
GloVe (glove-twitter-200, nonstatic) + Unidirectional GRU 3,408,601 [3] 0.81244
GloVe (glove-twitter-200, nonstatic) + Bidirectional GRU 4,538,073 [3] 0.80999
LLMs
Click to view
Model (with paper link) Pretrain parameters Training configurations Public F1 Notes
ALBERT base-v1 11M (huggingface) [2] 0.80907 View list of parameters by huggingface here
large-v1 17M (huggingface) [2] 0.80416
xlarge-v1 58M (huggingface) [4] 0.81182
xxlarge-v1 223M (huggingface) [4] 0.78853
base-v2 11M (huggingface) [2] 0.79528
large-v2 17M (huggingface) [2] 0.81520
xlarge-v2 58M (huggingface) [4] 0.81703
xxlarge-v2 223M (huggingface) [4] 0.80570
BART base 140M (facebook-research) [2] 0.82684 View list of parameters by facebook-research here
large 400M (facebook-research) [2] 0.83726
large-mnli 400M (facebook-research) [2] 0.83450
large-cnn 400M (facebook-research) [2] 0.82347
BERT base uncased 110M (huggingface) [2] 0.82899 View list of parameters by huggingface here
base cased 110M (huggingface) [2] 0.81060
large uncased 340M (huggingface) [2] 0.83052
large cased 340M (huggingface) [2] 0.82194
large uncased whole word masking 335M (huggingface) [2] 0.82255
large cased whole word masking 336M (huggingface) [2] 0.81244
multilingual uncased 168M (huggingface) [2] 0.81887
multilingual cased 179M (huggingface) [2] 0.81918
BERTweet base 135M (vinai) [2] 0.83726 View list of parameters by vinai here
covid19-base-uncased 135M (vinai) [2] 0.84002
covid19-base-cased 135M (vinai) [2] 0.82960
large 335M (vinai) [2] 0.82899
BORT base 56.1M (amazon) [2] 0.74563 Parameters from the original paper
DeBERTa base 100M (microsoft) [2] 0.81642 View list of parameters by microsoft here
base-mnli 86M (microsoft) [2] 0.80661
large 350M (microsoft) [4] 0.84308
large-mnli 350M (microsoft) [4] 0.83757
DeBERTa v3 xsmall 22M (microsoft) [2] 0.80815 View list of parameters by microsoft here
small 44M (microsoft) [2] 0.82408
base 86M (microsoft) [2] 0.83205
large 304M (microsoft) [4] 0.82745
mdeberta-v3-base 86M (microsoft) [2] 0.82929
DistilBERT base uncased 66M (huggingface) [2] 0.82439 View list of parameters by huggingface here
base cased 65M (huggingface) [2] 0.82163
multilingual cased 134M (huggingface) [2] 0.80049
ELECTRA (discriminator) small 14M (google) [2] 0.81887 View list of parameters by google here
base 110M (google) [2] 0.82776
large 335M (google) [2] 0.83726
RoBERTa base 125M (huggingface) [2] 0.82868 View list of parameters by huggingface here
large 335M (huggingface) [2] 0.84033
large 355M (huggingface) [2] 0.84033
distilroberta-large 82M (huggingface) [2] 0.82960
SqueezeBERT uncased 51M (huggingface) [2] 0.80324 View list of parameters by huggingface here
mnli 51M (huggingface) [2] 0.79987
mnli-headless 51M (huggingface) [2] 0.80416
Twitter RoBERTa Sentiment base N/A [2] 0.83389 CardiffNLP has a huge list of Twitter pretrained models and these are just 3 of them. Try finetuning others (if you have time).
base latest N/A [2] 0.82776
base 2021 124M (cardiffnlp) [2] 0.83083
XLM-RoBERTa base 270M (huggingface) [2] 0.82439 View list of parameters by huggingface here
large 550M (huggingface) [2] 0.82500
XLNet base cased 110M (huggingface) [2] 0.82592 View list of parameters by huggingface here
large cased 340M (huggingface) [4] 0.81612