Comparison of Word Embeddings and Sentence Encodings as Generalized Representations for Crisis Tweet Classification Tasks

This repo contains the python 3.7 scripts for paper Comparison of Word Embeddings and Sentence Encodings as Generalized Representations for Crisis Tweet Classification Tasks.

Requriments

Require numpy, pandas, scikit-learn, gensim to run word embedding related experiments. For other sentence encodings, SIF additionally requires theano and lasagne, InferSent additionally requires pytorch and nltk, and Universal encoding (tfSent) from Tensorflow requires Tensorflow.

Data

Preprocessed data files are in ''data'' directory.

Get started

To run word embeddings related experiments:

Download word embeddings files at link to ''data'' directory .
Run wordembs_supervised_clfs.py:

python wordembs_supervised_clfs.py -h

usage: wordembs_supervised_clfs.py [-h]

                               [-d {t6,t26,2C}]
                               [-c {BernoulliNB,GaussianNB,RF,SVM,KNN}]
                               [-bow {binary,embedding}]
                               [-e {Glove,crisisGlove,Word2Vec,crisisWord2Vec,FastText,crisisFastText}]
                               [-a {mean,tfidf,minmaxmean}]

Example:

python wordembs_supervised_clfs.py -d t26 -c GaussianNB -bow embedding -e Glove -a mean

NOTE that BernoulliNB classifier is only makes sense when using binary bag-of-word representations.

Sentence Encodings related

To run sentence encodings related experiments:

SIF:

cd SIF_sentence
python SIF_sentence.py -h

Example:

python SIF_sentence.py -d t26 -c GaussianNB

InferSent:

cd InferSent

Download GloVe (V1):

mkdir GloVe
curl -Lo GloVe/glove.840B.300d.zip http://nlp.stanford.edu/data/glove.840B.300d.zip
unzip GloVe/glove.840B.300d.zip -d GloVe/

Download our InferSent models (V1 trained with GloVe):

mkdir encoder
curl -Lo encoder/infersent1.pkl https://dl.fbaipublicfiles.com/infersent/infersent1.pkl

Note that infersent1 is trained with GloVe (which have been trained on text preprocessed with the PTB tokenizer).

Make sure you have the NLTK tokenizer by running the following once:

import nltk
nltk.download('punkt')

Run inferSent_crisis_LOO.py:

python inferSent_crisis_LOO.py -h
usage: inferSent_crisis_LOO.py [-h] [-d {t6,t26,2C}]
                               [-c {GaussianNB,RF,SVM,KNN}]

Example:

python inferSent_crisis_LOO.py -d t26 -c RF

NOTE: The original paper was using an old version InferSent pretrained model with GloVe, so the results are slightly different. Interesting users can also try their V2 version model trained with InferSent. For details go to InferSent.

tfSent

run tf_sentence.py

python tf_sentence.py -h
usage: tf_sentence.py [-h] [-d {t6,t26,2C}] [-c {GaussianNB,RF,SVM,KNN}]

Example:

 python tf_sentence.py -d t26 -c RF

References

For more details and full experimental results, see the paper.

@article{li2018comparison,
  title={Comparison of Word Embeddings and Sentence Encodings as Generalized Representations for Crisis Tweet Classification Tasks},
  author={Li, Hongmin and Li, Xukun and Caragea, Doina and Caragea, Cornelia},
booktitle={Proceedings of the ISCRAM Asian Pacific 2018 Conference – Wellington, New Zealand, November 2018},
year={2018}
}

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
InferSent		InferSent
SIF_sentence		SIF_sentence
data		data
MeanEmbedding.py		MeanEmbedding.py
README.md		README.md
load_utils.py		load_utils.py
tf_sentence.py		tf_sentence.py
wordembs_supervised_clfs.py		wordembs_supervised_clfs.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Comparison of Word Embeddings and Sentence Encodings as Generalized Representations for Crisis Tweet Classification Tasks

Requriments

Data

Get started

Sentence Encodings related

References

About

Releases

Packages

Languages

whitneyli/Comparision-word-embedding-for-crisis-tweets

Folders and files

Latest commit

History

Repository files navigation

Comparison of Word Embeddings and Sentence Encodings as Generalized Representations for Crisis Tweet Classification Tasks

Requriments

Data

Get started

Sentence Encodings related

References

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages