# Formality Benchmarking

This is an implementation of the perplexity model used in <br>
`E. Pavlick and J. Tetreault. An empirical analysis of formality in online communication.` <br>
`Transactions of the Association for Computational Linguistics, 4:61–74, 2016.` <br>

### What is Perplexity?
Perplexity of a model is how well the model predicts the sample, or in this case the sequence it is fed. What we will do here for formality transfer is fit a model to the gigaword corpus (in this case just news articles, which are pretty formal and written by overly competent writers,) and then measure the perplexity of the sequences predicted by the model to see how close they are to being from the same distribution. 

In this case we are going to create a Language Model that reperesents the distribution of the words of the corpus. Then, when we do style transfer later, we will be able to see how formal the outputs are, by seeing how well they match the distribution we are about to model. 

In [7]:
import numpy as np
import tensorflow as tf
import tensorflow_datasets as tfds

In [8]:
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.utils import to_categorical
from tensorflow.keras import regularizers

In [9]:
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords
from nltk.util import ngrams

This will be trained on a subset of gigaword, specifically just the portion that is from news articles. This version of the data set is more accessible and much more affordable. 

In [32]:
data = tfds.load("Gigaword", split=["train"], as_supervised=True)

In [37]:
temp = np.vstack(tfds.as_numpy(data[0]))

  """Entry point for launching an IPython kernel.


Let's clean the data. Pretty basic, just going to decode (strings > bytes right now) and remove stop words, nothing super fancy here. 

In [41]:
def clean_data(dirty_data):
    clean_data = []
    for row in dirty_data:
        row = row[0].decode("utf-8").lower()
        
        cleaned = filter(lambda word: word not in stopwords.words('english'), row.split(" "))
        
        clean_data.append(" ".join(cleaned))
    return clean_data

In [None]:
corpus = clean_data(temp)

Now I'm going to create some train and test sets. In the end, I'm thinking I'll sleep at night knowing the model is accurate for ~10000 examples, but maybe I'll change my mind some day. 

In [None]:
with open('cleaned-data.txt', 'w') as file:
    for sequence in corpus:
        file.write()

In [None]:
test_idx = np.random.choice(len(corpus), 10000)
train_idx = list(set([i for i in range(len(corpus))]) - set(test_idx))

train = [corpus[i] for i in train_idx]
test = [corpus[i] for i in test_idx]

In [None]:
tokenizer = Tokenizer(oov_token = '<OOV>')
tokenizer.fit_on_texts(train)

train_sequences = tokenizer.texts_to_sequences(train)
test_sequences = tokenizer.texts_to_sequences(test)

vocab_size = len(tokenizer.word_index)

Now we need to split the data into $n$-grams

In [None]:
def split_data_with_ngrams(data, n):
    X, y = [], []
    for sequence in data:
        for ngram in zip(*[sequence[i:] for i in range(n)]):
            X.append(np.array(ngram[:-1]))
            y.append(ngram[-1])
        
    return np.array(X), np.array(y)

In [None]:
X_train, y_train = split_data_with_ngrams(train_sequences, 3)
X_test, y_test = split_data_with_ngrams(test_sequences, 3)

y_train = tf.keras.utils.to_categorical(y_train, vocab_size+1)
y_test = tf.keras.utils.to_categorical(y_test, vocab_size+1)

### Now a model can be built!

In [None]:
model = tf.keras.Sequential([
    tf.keras.layers.Embedding(vocab_size+1, 50, input_length=2),
    tf.keras.layers.Dropout(0.2), 
    tf.keras.layers.LSTM(100, return_sequences=True),
    tf.keras.layers.LSTM(100),
    tf.keras.layers.Dense(100, activation='relu'),
    tf.keras.layers.Dense(vocab_size+1, activation='softmax')
])

model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
model.summary()

I want to know how long this is taking to train, so I'll print every 100 epochs. Unfortunately, I can't do this wither the `verbose` argument, however I can write my own callback to print. Not quite as pretty, but it'll get the job done.

In [None]:
class BetterVerboseCallback(tf.keras.callbacks.Callback):
    def on_epoch_end(self, epoch, logs=None):
        if epoch % 100 == 0:
            print("Epoch: ", epoch)
            print("loss: ", logs["loss"])
            print("accuracy", logs["accuracy"])
            print("=" * 30)

In [None]:
model.fit(X_train, 
          y_train, 
          batch_size=256, 
          epochs=50, 
          verbose=0,
          callbacks=[BetterVerboseCallback()])

In [None]:
model.evaluate(X_test, y_test)