# Formality Benchmarking

## <b> WARNING </b>
This is a mess. It's mostly for me. If you're reading this, read on with caution.

This is an implementation of the perplexity model used in <br>
`E. Pavlick and J. Tetreault. An empirical analysis of formality in online communication.` <br>
`Transactions of the Association for Computational Linguistics, 4:61–74, 2016.` <br>

### What is Perplexity?
Perplexity of a model is how well the model predicts the sample, or in this case the sequence it is fed. What we will do here for formality transfer is fit a model to the gigaword corpus (in this case just news articles, which are pretty formal and written by competent writers,) and then measure the perplexity of the sequences predicted by the model to see how close they are to being from the same distribution. 

In this case we are going to create a Language Model that reperesents the distribution of the words of the corpus. Then, when we do style transfer later, we will be able to see how formal the outputs are, by seeing how well they match the distribution we are about to model. 

In [None]:
import pickle
import json

In [1]:
import numpy as np
import tensorflow as tf
import tensorflow_datasets as tfds

In [2]:
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

The below does not need to be run every time, use the data from `cleaned-data.txt`

---

### Data Loading

This will be trained on a subset of gigaword, specifically just the portion that is from news articles. This version of the data set is more accessible and much more affordable. 

In [4]:
raise Exception("Make sure you want to run this!!!")

In [3]:
from nltk.corpus import stopwords

In [5]:
data = tfds.load("Gigaword", split=["train"], as_supervised=True)

In [None]:
temp = np.vstack(tfds.as_numpy(data[0]))

Let's clean the data. Pretty basic, just going to decode (strings > bytes right now) and remove stop words, nothing super fancy here. 

In [None]:
def clean_data(dirty_data):
    clean_data = []
    for row in dirty_data:
        row = row[0].decode("utf-8").lower()
        
        cleaned = filter(lambda word: word not in stopwords.words('english'), row.split(" "))
        
        clean_data.append(" ".join(cleaned))
    return clean_data

In [None]:
# corpus = clean_data(temp)

This is commented out because it takes forever to run 

In [None]:
# with open('cleaned-data.txt', 'w') as file:
#     for sequence in corpus:
#         file.write('%s\n' % sequence)

---

In [3]:
with open('Data/Formality-Cleaned.txt', 'r') as file:
    corpus = [line[:-1] for line in file]

For benchmarking now, going to train on all 4 million sequences, then if this doesn't work, make a better language model.

In [4]:
chosen_data = np.random.choice(len(corpus), 150000)

test_idx = np.random.choice(len(chosen_data), 500)
val_idx = np.random.choice(list(set(chosen_data) - set(test_idx)), 500)
train_idx = list(set(chosen_data) - set(test_idx) - set(val_idx))

train = [corpus[i] for i in train_idx]
val = [corpus[i] for i in val_idx]
test = [corpus[i] for i in test_idx]

Now I'm going to tokenize the data for the RNN

In [5]:
tokenizer = Tokenizer(oov_token='<OOV>')
tokenizer.fit_on_texts(train)

Here we convert our sequences using the tokenizer

In [6]:
train_sequences = tokenizer.texts_to_sequences(train)
val_sequences = tokenizer.texts_to_sequences(val)
test_sequences = tokenizer.texts_to_sequences(test)

vocab_size = len(tokenizer.word_index)

Now we need to split the data into $n$-grams, I'm starting off by using 3-grams, that's what Pavlick and Tetreault used, but there is a lot of data here so maybe 5-grams will be explored latter

---

### For non-neural ngram model

In [28]:
from nltk.lm import MLE
from nltk.lm.preprocessing import padded_everygram_pipeline

In [25]:
def split_data_with_ngrams(data, n):
#     X, y = [], []
#     for sequence in data:
#         for ngram in zip(*[sequence[i:] for i in range(n)]):
#             X.append(np.array(ngram[:-1]))
#             y.append(ngram[-1])
    ngrams = [ngram for sequence in data for ngram in zip(*[sequence[i:] for i in range(n)])]
#     return np.array(X), np.array(y)
    return ngrams

In [26]:
train_ngrams = split_data_with_ngrams(train_sequences, 3)
val_ngrams = split_data_with_ngrams(val_sequences, 3)
test_ngrams = split_data_with_ngrams(test_sequences, 3)

In [29]:
train, vocab = padded_everygram_pipeline(3, train_sequences)

In [7]:
model = MLE(3)
model.fit(train_data, train_sequences)

---

### For LSTM N-gram Model

In [7]:
def split_data_ngrams(seqs, batch_size=256):            
    seqs = [seq[:i+1] for seq in seqs for i in range(1, len(seq))]
#     y = tf.SparseTensor(indices=[[i, val] for i, val in enumerate([seq[-1] for seq in seqs])],
#                         values=[seq[-1] for seq in seqs], 
#                         dense_shape=[len(seqs), vocab_size+1])
    print("Number of Sequences: ", len(seqs))
    y = tf.constant([seq[-1] for seq in seqs])
    x = tf.ragged.constant([seq[:-1] for seq in seqs])
    data = tf.data.Dataset.from_tensor_slices((x, y))
    return data.batch(batch_size).map(lambda x, y: (x, tf.one_hot(y, depth=vocab_size+1)))

In [8]:
train_data = split_data_ngrams(train_sequences)
val_data = split_data_ngrams(val_sequences)
test_data = split_data_ngrams(test_sequences)

Number of Sequences:  2692667
Number of Sequences:  9024
Number of Sequences:  9322


### Now a model can be built!

First I'm going to load GloVe embeddings from <a href=https://nlp.stanford.edu/projects/glove/> Stanford </a> 

In [9]:
embeddings_index = {};
with open('Data/embeddings/glove.6B.50d.txt') as f:
    for line in f:
        values = line.split()
        word = values[0]
        coefs = np.asarray(values[1:], dtype='float32')
        embeddings_index[word] = coefs

embeddings_matrix = np.zeros((vocab_size+1, 50))
for word, i in tokenizer.word_index.items():
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        embeddings_matrix[i] = embedding_vector

* Single 2048 LSTM with 100 Dense has Bias problems 
    * This included Dropout between LSTM and Dense
* Double 2048 LSTM with 1000 Dense overfits, and adding Dropout or L1/L2 gets OOM
*


In [10]:
model = tf.keras.Sequential([
    tf.keras.layers.Embedding(vocab_size+1, 50, weights=[embeddings_matrix], trainable=False),
    tf.keras.layers.Dropout(0.2), 
    tf.keras.layers.LSTM(1024, return_sequences=True),
    tf.keras.layers.LSTM(1024),
    tf.keras.layers.Dropout(0.2),
    tf.keras.layers.Dense(500, activation='relu'),
    tf.keras.layers.Flatten(),
    tf.keras.layers.Dense(vocab_size+1, activation='softmax')
])

model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

In [11]:
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, None, 50)          3022100   
_________________________________________________________________
dropout (Dropout)            (None, None, 50)          0         
_________________________________________________________________
lstm (LSTM)                  (None, None, 1024)        4403200   
_________________________________________________________________
lstm_1 (LSTM)                (None, 1024)              8392704   
_________________________________________________________________
dropout_1 (Dropout)          (None, 1024)              0         
_________________________________________________________________
dense (Dense)                (None, 500)               512500    
_________________________________________________________________
flatten (Flatten)            (None, 500)               0

2b566508e0c03db3e67c087d43de4e8dbcfb38faae994412This is reporting accuracy, which is pretty useless for the LM, so ignore if you are looking at this

In [13]:
history = model.fit(train_data,
                    epochs=5,
                    validation_data=(val_data),
                    verbose=1)

Epoch 1/5
   10/10519 [..............................] - ETA: 24:04 - loss: 7.8806 - accuracy: 0.0211

KeyboardInterrupt: 

In [23]:
import matplotlib.pyplot as plt

In [None]:
plt.plot(np.exp(history.history['loss']))
plt.plot(np.exp(history.history['val_loss']))