# Using NMT to process Twitter data
I developed an app that listens to Twitter feeds and saves the response and original Tweet to a Sqlite3 database.  I then used the data as input for this tutorial:  https://github.com/tensorflow/nmt.  I had to do a few cleanup steps first to get the data loaded, but it seems to be working now.

## Set environment variables and save to a script
This is the only way I could figure out how to do this so the variables would be available in each cell.  Basically I write a file called env.sh and reference that in every subsequent bash cell.



In [58]:
%%writefile env.sh
export HOME_DIR=`pwd`
export DATA_DIR=$HOME_DIR/data_files
export MODEL_DIR=$HOME_DIR/nmt_model
export NMT_DIR=$HOME_DIR/nmt

Overwriting env.sh


## Delete old directories if they exist
In later cells, we're going to recreate these.

In [61]:
%%bash
. ./env.sh
if [ -d "$DATA_DIR" ]; then
    echo removing $DATA_DIR
    rm -r $DATA_DIR
fi

if [ -d "$MODEL_DIR" ]; then
    echo removing $MODEL_DIR
    rm -r $MODEL_DIR
fi



removing /home/rich/src/concretely/concretely_nmt/data_files
removing /home/rich/src/concretely/concretely_nmt/nmt_model


## Checkout the nmt project from GitHub
Only do this if the directory doesn't exist yet

In [48]:
%%bash
. ./env.sh
mkdir $DATA_DIR
mkdir $MODEL_DIR

if [ ! -d "$NMT_DIR" ]; then
  git clone https://github.com/tensorflow/nmt/
fi

## Process the database
This program reads the sqlite3 tables and extracts the Tweet text from the Tweet json and does a little cleanup.  Right now, input and output folders are hardcoded, but I plan on fixing that.  It also splits the data into train, eval and test sets.

In [49]:
%%bash
. ./env.sh
python $HOME_DIR/process_db.py
ls $DATA_DIR

input.txt
output.txt
train.from
train.to
tst2012.from
tst2012.to
tst2013.from
tst2013.to


## Generate vocab files
This program generates vocabulary files for the input and output files.  I plan on either updating the generate_vocab.py or merging with process_db.py to get rid of all this sed stuff.  I was having issues receiving duplicates and blank lines, so for now I used these tools. 

In [50]:
%%bash
. ./env.sh
python generate_vocab.py --max_vocab_size=10000 --downcase=True $DATA_DIR/train.from | sed 's/[[:punct:]]//g; s/[[:space:]]/\n/g' | sort | uniq | sed '/^[[:space:]]*$/d' > $DATA_DIR/vocab.from
python generate_vocab.py --max_vocab_size=10000 --downcase=True $DATA_DIR/train.to | sed 's/[[:punct:]]//g; s/[[:space:]]/\n/g' | sort | uniq | sed '/^[[:space:]]*$/d' > $DATA_DIR/vocab.to
ls $DATA_DIR

input.txt
output.txt
train.from
train.to
tst2012.from
tst2012.to
tst2013.from
tst2013.to
vocab.from
vocab.to


## Run the model
This is the line to actually run the model.  Note that num_train_steps=5.  This is just to verify that the code runs.  Originally it was set to 12000, this like the other parameters can be modified

In [53]:
%%bash
. ./env.sh
cd $NMT_DIR

python -m nmt.nmt \
    --src=from --tgt=to \
    --vocab_prefix=$DATA_DIR/vocab  \
    --train_prefix=$DATA_DIR/train \
    --dev_prefix=$DATA_DIR/tst2012  \
    --test_prefix=$DATA_DIR/tst2013 \
    --out_dir=$MODEL_DIR \
    --num_train_steps=5 \
    --steps_per_stats=100 \
    --num_layers=2 \
    --num_units=128 \
    --dropout=0.2 \
    --metrics=bleu

# Job id 0
# Devices visible to TensorFlow: [_DeviceAttributes(/job:localhost/replica:0/task:0/device:CPU:0, CPU, 268435456, 726199474441383584), _DeviceAttributes(/job:localhost/replica:0/task:0/device:XLA_CPU:0, XLA_CPU, 17179869184, 3929896018156925669)]
# Vocab file /home/rich/src/concretely/concretely_nmt/data_files/vocab.from exists
The first 3 vocab words [ツ, 🤬, 🤣🤣] are not [<unk>, <s>, </s>]
# Vocab file /home/rich/src/concretely/concretely_nmt/data_files/vocab.to exists
The first 3 vocab words [🤣, 🤣🤣, 0033] are not [<unk>, <s>, </s>]
  saving hparams to /home/rich/src/concretely/concretely_nmt/nmt_model/hparams
  saving hparams to /home/rich/src/concretely/concretely_nmt/nmt_model/best_bleu/hparams
  attention=
  attention_architecture=standard
  avg_ckpts=False
  batch_size=128
  beam_width=0
  best_bleu=0
  best_bleu_dir=/home/rich/src/concretely/concretely_nmt/nmt_model/best_bleu
  check_special_token=True
  colocate_gradients_with_ops=True
  decay_scheme=
  dev_prefix=/hom

2019-01-09 01:10:13.754518: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
  from ._conv import register_converters as _register_converters
Instructions for updating:
Use `tf.data.experimental.group_by_window(...)`.
Instructions for updating:
This class is deprecated, please use tf.nn.rnn_cell.LSTMCell, which supports all the feature this cell currently has. Please replace the existing code with tf.nn.rnn_cell.LSTMCell(name='basic_lstm_cell').
2019-01-09 01:10:17.887626: I tensorflow/core/kernels/lookup_util.cc:376] Table trying to initialize from file /home/rich/src/concretely/concretely_nmt/nmt_model/vocab.to is already initialized.
2019-01-09 01:10:17.887626: I tensorflow/core/kernels/lookup_util.cc:376] Table trying to initialize from file /home/rich/src/concretely/concretely_nmt/nmt_model/vocab.from is already initialized.
2019-01-09 01:10:17.887626: I tensorflow/core/kernels/lookup_

In [57]:
%%bash
. ./env.sh
cd $HOME_DIR