Natural Language Processing with Disaster Tweets

This project is licensed under The GNU GPL v3

Notebooks

Please have a look at the version history of each notebook.

Statistical models:

Statistical models

Deep learning models:

RNN
RNN with Attention
CNN
Multi-channel CNN with RNN - unidirectional & bidirectional
Multi-channel CNN with RNN (concat) - unidirectional & bidirectional
LLMs

Code

View training & testing script's help with this command:

python <script>.py --help

Note: use those scripts at your own risk, since I don't normally re-train models on my personal PC.

Baselines

Text preprocessing

Different text preprocessing methods used in my implementations, but most methods following these steps

Removing emojis
Removing html
Removing URLs
Removing punctuations
Lowercase and remove multiple spaces.

However there are some exceptions where a specific preprocessing method of the pretrained model is applied:

BERTweet using TweetTokenizer to mask and replace some tokens
Twitter RoBERTa Sentiment requires masking username and url as specific tokens.

Training configurations

Statistical models

Training data: Using full training set.
Hyperparameters: using sklearn.model_selection.GridSearchCV to automatically pick best combinations.

Deep learning models

RNNs

Click to view

Hyperparameter	Value
Train:test	8:2
Batch size (train/test)	64/32
Learning rate	1e-4
Embedding dim	64
Epochs	10
Vocab size	10000

CNNs

Click to view

Hyperparameter	Value
Train:test	8:2
Batch size (Train/test)	64/32
Filter size	100
Window size	`[3, 4, 5]`
L2 regularization	3
Dropout rate	0.5
Dense unit	64
Learning rate	1e-4
Epochs	100
Vocab size	10000
Early stopping	20 epochs
Classification threshold	0.5

CNN & RNN

Click to view

CNN & RNN feed model:

Hyperparameter	Value
Train:test	8:2
Batch size (Train/test)	64/32
Recurrent units	512
Filter size	200
Window size	`[1, 2, 3]`
Dropout rate	0.5
Dense unit	64
Learning rate	1e-4
Epochs	100
Vocab size	10000
Early stopping	20 epochs
Classification threshold	0.5

CNN & BiRNN feed model:

Hyperparameter	Value
Train:test	8:2
Batch size (Train/test)	64/32
Recurrent units	512
Filter size	200
Window size	`[1, 2, 3]`
Dropout rate	0.5
Dense unit	64
Learning rate	1e-4
Epochs	100
Vocab size	10000
Early stopping	10 epochs
Classification threshold	0.5

CNN & RNN concat model:

Hyperparameter	Value
Train:test	8:2
Batch size (Train/test)	64/32
Recurrent units	512
Filter size	200
Window size	`[1, 2, 3]`
Dropout rate	0.5
Dense unit	64
Learning rate	1e-4
Epochs	100
Vocab size	10000
Early stopping	5 epochs
Classification threshold	0.5

CNN & BiRNN concat model:

Hyperparameter	Value
Train:test	8:2
Batch size (Train/test)	64/32
Recurrent units	512
Filter size	200
Window size	`[1, 2, 3]`
Dropout rate	0.5
Dense unit	64
Learning rate	1e-4
Epochs	100
Vocab size	10000
Early stopping	10 epochs
Classification threshold	0.5

LLMs

Click to view

Hyperparameter	Value
Train:dev:test ratio	6:2:2
Batch size	64
Learning rate	2e-5
Weight decay	0.01
Epochs	50
Early stopping	5 epochs

Too-large LLMs

Click to view

Some large LLMs cannot be trained with hyperparameters in the LLMs section. In order to fit those models to Kaggle GPU's RAM, I reduced the batch size and learning rate to following values:

Hyperparameter	Value
Train:dev:test ratio	6:2:2
Batch size	32
Learning rate	1e-5
Weight decay	0.01
Epochs	50
Early stopping	5 epochs

All remaining hyperparametes stay the same as LLMs.

Results

Experiment setup: All experiments were conducted under the same Kaggle environment:

Configuration	Value
CPU	Intel Xeon 2.20 GHz CPU, 4vCPU cores
Memory	32 GB
GPU	NVIDIA Tesla T4 (x2) (LLMs) or P100 (RNNs, CNNs)
Random seed	42

wandb.ai report

Statistical models

Click to view

Model		Vectorizer	Training configurations	Public F1
KMean		TFIDF	[1]	0.50658
Linear Models	Logistic Regression	TFIDF	[1]	0.80171
Linear Models	Stochastic Gradient Descent	TFIDF	[1]	0.80386
Support Vector Machine		TFIDF	[1]	0.80140
Random Forest		TFIDF	[1]	0.78792
AdaBoost	Decision Tree	TFIDF	[1]	0.72847
Bagging	Decision Tree	TFIDF	[1]	0.74348
Decision Tree		TFIDF	[1]	0.71069
Gradient Boosting		TFIDF	[1]	0.73889
Gradient Boosting	XGBoost	TFIDF	[1]	0.74992
Naive Bayes	Multinomial Naive Bayes	TFIDF	[1]	0.80447
Naive Bayes	Complement Naive Bayes	TFIDF	[1]	0.79589
Multilayer Perceptrons		TFIDF	[1]	0.75911

Deep learning models

RNNs, CNNs and ensemble models

Click to view

Model (with paper link)		Pretrain parameters	Training configurations	Public F1	Notes
RNN	1-layer Bidirectional LSTM	714,369	[3]	0.77352
	2-layers stacked Bidirectional LSTM	751,489	[3]	0.78026
	1-layer Bidirectional GRU	698,241	[3]	0.77536
	2-layers stacked Bidirectional GRU	725,249	[3]	0.77566
RNN + Attention	1-layer Bidirectional LSTM + Dot Attention	714,369	[3]	0.76892
	1-layer Bidirectional GRU + Dot Attention	698,241	[3]	0.78516
	1-layer Bidirectional LSTM + General Attention	730,881	[3]	0.77995
	1-layer Bidirectional GRU + General Attention	714,753	[3]	0.77719
	1-layer Bidirectional LSTM + Concatenate Attention	730,946	[3]	0.78148
	1-layer Bidirectional GRU + Concatenate Attention	714,818	[3]	0.77873
Deep CNN (random + pretrained embedding)	CNN non-static (random embedding)	299,629	[3]	0.71345	Embedding dimension = 25 (equals to GloVe vector size)
	CNN static (glove-twitter-25)	299,629	[3]	0.77689
	CNN static (glove-twitter-50)	579,629	[3]	0.78700
	CNN static (glove-twitter-100)	1,139,629	[3]	0.79374
	CNN static (glove-twitter-200)	2,259,629	[3]	0.79711
	CNN static (fasttext-wiki-news-subwords-300)	3,379,629	[3]	0.57033
	CNN non-static (glove-twitter-25)	299,629	[3]	0.80478
	CNN non-static (glove-twitter-50)	579,629	[3]	0.79619
	CNN non-static (glove-twitter-100)	1,139,629	[3]	0.79987
	CNN non-static (glove-twitter-200)	2,259,629	[3]	0.80140
	CNN non-static (fasttext-wiki-news-subwords-300)	3,379,629	[3]	0.73980
Multi-channel CNN and RNN	Random embedding (static) + Unidirectional LSTM	3,326,169	[3]	0.67391
	Random embedding (static) + Bidirectional LSTM	4,411,609	[3]	0.68709
	Random embedding (static) + Unidirectional GRU	(todo)	[3]	(todo)
	Random embedding (static) + Bidirectional GRU	(todo)	[3]	(todo)
	GloVe (glove-twitter-25, static) + Unidirectional LSTM	1,366,169	[3]	0.68372
	GloVe (glove-twitter-25, static) + Bidirectional LSTM	2,451,609	[3]	0.78976
	GloVe (glove-twitter-50, static) + Unidirectional LSTM	1,646,169	[3]	0.77781
	GloVe (glove-twitter-50, static) + Bidirectional LSTM	2,731,609	[3]	0.78148
	GloVe (glove-twitter-100, static) + Unidirectional LSTM	2,206,169	[3]	0.73460
	GloVe (glove-twitter-100, static) + Bidirectional LSTM	3,291,609	[3]	0.78700
	GloVe (glove-twitter-200, static) + Unidirectional LSTM	3,326,169	[3]	0.71835
	GloVe (glove-twitter-200, static) + Bidirectional LSTM	4,411,609	[3]	0.76310
	Random embedding (nonstatic) + Unidirectional LSTM	3,326,169	[3]	0.71284
	Random embedding (nonstatic) + Bidirectional LSTM	4,411,609	[3]	0.75390
	Random embedding (nonstatic) + Unidirectional GRU	(todo)	[3]	(todo)
	Random embedding (nonstatic) + Bidirectional GRU	(todo)	[3]	(todo)
	GloVe (glove-twitter-25, nonstatic) + Unidirectional LSTM	1,366,169	[3]	0.75942
	Glove (glove-twitter-25, nonstatic) + Bidirectional LSTM	2,451,609	[3]	0.79436
	GloVe (glove-twitter-50, nonstatic) + Unidirectional LSTM	1,646,169	[3]	0.78240
	GloVe (glove-twitter-50, nonstatic) + Bidirectional LSTM	2,731,609	[3]	0.79957
	GloVe (glove-twitter-100, nonstatic) + Unidirectional LSTM	2,206,169	[3]	0.78700
	GloVe (glove-twitter-100, nonstatic) + Bidirectional LSTM	3,291,609	[3]	0.76064
	GloVe (glove-twitter-200, nonstatic) + Unidirectional LSTM	3,326,169	[3]	0.78179
	GloVe (glove-twitter-200, nonstatic) + Bidirectional LSTM	4,411,609	[3]	0.77474
Multi-channel CNN and RNN (concat)	Random embedding (static) + Unidirectional LSTM	3,772,121	[3]	0.78394	Embedding dimension = 200
	Random embedding (static) + Bidirectional LSTM	5,265,113	[3]	0.78700
	Random embedding (static) + Unidirectional GRU	3,408,601	[3]	0.78302
	Random embedding (static) + Bidirectional GRU	4,538,073	[3]	0.77627
	GloVe (glove-twitter-25, static) + Unidirectional LSTM	1,453,721	[3]	0.80110
	GloVe (glove-twitter-25, static) + Bidirectional LSTM	2,588,313	[3]	0.79436
	GloVe (glove-twitter-25, static) + Unidirectional GRU	1,179,801	[3]	0.80294
	GloVe (glove-twitter-25, static) + Bidirectional GRU	2,040,473	[3]	0.79528
	GloVe (glove-twitter-50, static) + Unidirectional LSTM	1,784,921	[3]	0.81091
	GloVe (glove-twitter-50, static) + Bidirectional LSTM	2,970,713	[3]	0.81366
	GloVe (glove-twitter-50, static) + Unidirectional GRU	1,498,201	[3]	0.80907
	GloVe (glove-twitter-50, static) + Bidirectional GRU	2,397,273	[3]	0.80937
	GloVe (glove-twitter-100, static) + Unidirectional LSTM	2,447,321	[3]	0.80539
	GloVe (glove-twitter-100, static) + Bidirectional LSTM	3,735,513	[3]	0.81305
	GloVe (glove-twitter-100, static) + Unidirectional GRU	(todo)	[3]	(todo)
	GloVe (glove-twitter-100, static) + Bidirectional GRU	3,110,873	[3]	0.80907
	GloVe (glove-twitter-200, static) + Unidirectional LSTM	3,772,121	[3]	0.80723
	GloVe (glove-twitter-200, static) + Bidirectional LSTM	5,265,113	[3]	0.81152
	GloVe (glove-twitter-200, static) + Unidirectional GRU	3,408,601	[3]	3,408,601
	GloVe (glove-twitter-200, static) + Bidirectional GRU	4,538,073	[3]	0.80815
	Random embedding (nonstatic) + Unidirectional LSTM	3,772,121	[3]	0.74164
	Random embedding (nonstatic) + Bidirectional LSTM	5,265,113	[3]	0.77444
	Random embedding (nonstatic) + Unidirectional GRU	3,408,601	[3]	0.80171
	Random embedding (nonstatic) + Bidirectional GRU	4,538,073	[3]	0.80049
	GloVe (glove-twitter-25, nonstatic) + Unidirectional LSTM	1,453,721	[3]	0.80876
	GloVe (glove-twitter-25, nonstatic) + Bidirectional LSTM	2,588,313	[3]	0.79834
	GloVe (glove-twitter-25, nonstatic) + Unidirectional GRU	1,179,801	[3]	0.80815
	GloVe (glove-twitter-25, nonstatic) + Bidirectional GRU	2,040,473	[3]	0.79650
	GloVe (glove-twitter-50, nonstatic) + Unidirectional LSTM	1,784,921	[3]	0.80539
	GloVe (glove-twitter-50, nonstatic) + Bidirectional LSTM	2,970,713	[3]	0.81213
	GloVe (glove-twitter-50, nonstatic) + Unidirectional GRU	1,498,201	[3]	0.80968
	GloVe (glove-twitter-50, nonstatic) + Bidirectional GRU	2,397,273	[3]	0.80386
	GloVe (glove-twitter-100, nonstatic) + Unidirectional LSTM	2,447,321	[3]	0.81029
	GloVe (glove-twitter-100, nonstatic) + Bidirectional LSTM	3,735,513	[3]	0.80968
	GloVe (glove-twitter-100, nonstatic) + Unidirectional GRU	2,135,001	[3]	0.80570
	GloVe (glove-twitter-100, nonstatic) + Bidirectional GRU	3,110,873	[3]	0.80815
	GloVe (glove-twitter-200, nonstatic) + Unidirectional LSTM	3,772,121	[3]	0.80508
	GloVe (glove-twitter-200, nonstatic) + Bidirectional LSTM	5,265,113	[3]	0.81182
	GloVe (glove-twitter-200, nonstatic) + Unidirectional GRU	3,408,601	[3]	0.81244
	GloVe (glove-twitter-200, nonstatic) + Bidirectional GRU	4,538,073	[3]	0.80999

LLMs

Click to view

Model (with paper link)		Pretrain parameters	Training configurations	Public F1	Notes
ALBERT	base-v1	11M (huggingface)	[2]	0.80907	View list of parameters by huggingface here
	large-v1	17M (huggingface)	[2]	0.80416
	xlarge-v1	58M (huggingface)	[4]	0.81182
	xxlarge-v1	223M (huggingface)	[4]	0.78853
	base-v2	11M (huggingface)	[2]	0.79528
	large-v2	17M (huggingface)	[2]	0.81520
	xlarge-v2	58M (huggingface)	[4]	0.81703
	xxlarge-v2	223M (huggingface)	[4]	0.80570
BART	base	140M (facebook-research)	[2]	0.82684	View list of parameters by facebook-research here
	large	400M (facebook-research)	[2]	0.83726
	large-mnli	400M (facebook-research)	[2]	0.83450
	large-cnn	400M (facebook-research)	[2]	0.82347
BERT	base uncased	110M (huggingface)	[2]	0.82899	View list of parameters by huggingface here
	base cased	110M (huggingface)	[2]	0.81060
	large uncased	340M (huggingface)	[2]	0.83052
	large cased	340M (huggingface)	[2]	0.82194
	large uncased whole word masking	335M (huggingface)	[2]	0.82255
	large cased whole word masking	336M (huggingface)	[2]	0.81244
	multilingual uncased	168M (huggingface)	[2]	0.81887
	multilingual cased	179M (huggingface)	[2]	0.81918
BERTweet	base	135M (vinai)	[2]	0.83726	View list of parameters by vinai here
	covid19-base-uncased	135M (vinai)	[2]	0.84002
	covid19-base-cased	135M (vinai)	[2]	0.82960
	large	335M (vinai)	[2]	0.82899
BORT	base	56.1M (amazon)	[2]	0.74563	Parameters from the original paper
DeBERTa	base	100M (microsoft)	[2]	0.81642	View list of parameters by microsoft here
	base-mnli	86M (microsoft)	[2]	0.80661
	large	350M (microsoft)	[4]	0.84308
	large-mnli	350M (microsoft)	[4]	0.83757
DeBERTa v3	xsmall	22M (microsoft)	[2]	0.80815	View list of parameters by microsoft here
	small	44M (microsoft)	[2]	0.82408
	base	86M (microsoft)	[2]	0.83205
	large	304M (microsoft)	[4]	0.82745
	mdeberta-v3-base	86M (microsoft)	[2]	0.82929
DistilBERT	base uncased	66M (huggingface)	[2]	0.82439	View list of parameters by huggingface here
	base cased	65M (huggingface)	[2]	0.82163
	multilingual cased	134M (huggingface)	[2]	0.80049
ELECTRA (discriminator)	small	14M (google)	[2]	0.81887	View list of parameters by google here
	base	110M (google)	[2]	0.82776
	large	335M (google)	[2]	0.83726
RoBERTa	base	125M (huggingface)	[2]	0.82868	View list of parameters by huggingface here
	large	335M (huggingface)	[2]	0.84033
	large	355M (huggingface)	[2]	0.84033
	distilroberta-large	82M (huggingface)	[2]	0.82960
SqueezeBERT	uncased	51M (huggingface)	[2]	0.80324	View list of parameters by huggingface here
	mnli	51M (huggingface)	[2]	0.79987
	mnli-headless	51M (huggingface)	[2]	0.80416
Twitter RoBERTa Sentiment	base	N/A	[2]	0.83389	CardiffNLP has a huge list of Twitter pretrained models and these are just 3 of them. Try finetuning others (if you have time).
	base latest	N/A	[2]	0.82776
	base 2021	124M (cardiffnlp)	[2]	0.83083
XLM-RoBERTa	base	270M (huggingface)	[2]	0.82439	View list of parameters by huggingface here
XLM-RoBERTa	large	550M (huggingface)	[2]	0.82500	View list of parameters by huggingface here
XLNet	base cased	110M (huggingface)	[2]	0.82592	View list of parameters by huggingface here
XLNet	large cased	340M (huggingface)	[4]	0.81612	View list of parameters by huggingface here

Name		Name	Last commit message	Last commit date
Latest commit History 140 Commits
code		code
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Natural Language Processing with Disaster Tweets

Notebooks

Code

Baselines

Text preprocessing

Training configurations

Statistical models

Deep learning models

RNNs

CNNs

CNN & RNN

LLMs

Too-large LLMs

Results

Statistical models

Deep learning models

RNNs, CNNs and ensemble models

LLMs

About

Languages

License

trhgquan/nlp-getting-started

Folders and files

Latest commit

History

Repository files navigation

Natural Language Processing with Disaster Tweets

Notebooks

Code

Baselines

Text preprocessing

Training configurations

Statistical models

Deep learning models

RNNs

CNNs

CNN & RNN

LLMs

Too-large LLMs

Results

Statistical models

Deep learning models

RNNs, CNNs and ensemble models

LLMs

About

Topics

Resources

License

Stars

Watchers

Forks

Languages