Sentiment Analysis using a simple LSTM network to classify short texts into 2 categories (positive and negative). The implemented LSTM network is structured as follows (note that the batch dimension is omitted in the explanation):
- Embedding layer: Transforms each input (a tensor of k words) into a tensor of k N-dimensional vectors (word embeddings), where N is the embedding size. Every word will be associated to a vector of weights that needs to be learnt during the training process. You can gain more insight into word embeddings at Vector Representations of Words.
- RNN layer: It's made out of LSTM cells with a dropout wrapper. The intuition of LSTM networks is nicely described at Understanding LSTM Networks. LSTM weights need to be learnt during the training process. The RNN layer is unrolled dynamically, taking k word embeddings as input and outputting k M-dimensional vectors, where M is the hidden size of LSTM cells.
- Softmax layer: The RNN-layer output is averaged across k timesteps, obtaining a single tensor of size M. Finally, a softmax layer is used to compute classification probabilities.
Cross-entropy is used as the loss function and RMSProp is the optimizer that minimizes it.
TensorBoard provides a nice overview of the whole graph:
- Python 3.5
- Pip 9.0.1
- Install TensorFlow. See TensorFlow installation guide
- Run
sudo pip install -r requirements.txt
To train a model, run python train.py
Optional flags:
--data_dir
: Data directory containingdata.csv
(must have at least columns 'SentimentText' and 'Sentiment') Intermediate files will automatically be stored here. Default isdata/Kaggle
--stopwords_file
: Path to stopwords file. Ifstopwords_file=None
, no stopwords will be used. Default isdata/stopwords.txt
--n_samples
: Number of samples to use from the dataset. Setn_samples=None
to use the whole dataset. Default isNone
--checkpoints_root
: Checkpoints directory root. Parameters will be saved there. Default ischeckpoints
--summaries_dir
: Directory where TensorFlow summaries will be stored. You can visualize learning using TensorBoard by runningtensorboard --logdir=<summaries_dir>
. Default islogs
--batch_size
: Batch size. Default is100
--train_steps
: Number of training steps. Default is300
--hidden_size
: Hidden size of LSTM layer. Default is75
--embedding_size
: Size of embedding layer. Default is75
--random_state
: Random state used for data splitting. Default is0
--learning_rate
: RMSProp learning rate. Default is0.01
--test_size
: Proportion of the dataset to be included in the test split (0<test_size<1
). Default is0.2
--dropout_keep_prob
: Dropout keep-probability (0<dropout_keep_prob<=1
). Default is0.5
--sequence_len
: Maximum sequence length. Let m be the maximum sequence length in the dataset. Then, it's required thatsequence_len >= m
. Ifsequence_len=None
, then it'll be automatically assigned tom
. Default isNone
--validate_every
: Step frequency in order to evaluate the model using a validation set. Default is100
After training the model, the checkpoints directory will be printed out. For example: Model saved in: checkpoints/1481294288
To make predictions using a previously trained model, run python predict.py --checkpoints_dir <checkpoints directory>
For example: python predict.py --checkpoints_dir checkpoints/1481294288