Loïc Ottet, Baptiste Raemy, Alexis Semple
In folder twitter-datasets/
pos_train.txtneg_train.txt
The steps described below are analoguous for the full dataset (pos_train_full.txt and neg_train_full.txt). All scripts act on the reduced datasets, but they can take full as their last argument to act to the full dataset.
- Build the vocabulary (
build_vocab.sh). Yieldsvocab.txt - Cut the vocabulary (
cut_vocab.sh). Yieldsvocab_cut.txt - Convert the vocabulary to a python dictionary mapping words to ids (
pickle_vocab.py). Yieldsvocab.pkl
- Compute the cooccurence matrix (
cooc.py). Yieldscooc.pkl - Compute the GloVe matrix (
glove_solution.py). Yieldsembeddings_glove-basic.npy
In what follows, ** represents the number of features of the word vector (25, 50, 100 or 200)
- The vectors are in
twitter-datasets/glove.twitter.27B.**d.txt - Compute the word embedding for a given dimension (
filterVocab.py **). Yieldsembeddings_glove**.npy)
- Compute the fastText vectors (
./fasttext skipgram -input data.txt -output model). Yieldsmodel.vec(data.txtis a concatenation of the positive and negative train sets) - Compute the word embedding for our vocabulary (
filterVocabFastText.py). Yieldsembeddings_fasttext.npy
To apply trainTensorflow.py and predic.py to the full dataset, use --full
- Load and pad the training data (
loadData.py). Yieldsx_train_padded.npyandy_train.npy - Train the neural network (
trainTensorflow.py --embeddings=***), where***is the name of the chosen embedding (glove-basic,glove**orfasttext). Yields detailed run data inuns/***_****/, where****is the timestamp of the run 3 . Generate predictions from the test set (predic.py --name=***_****). Yieldspredictions.csv