This is the notebook implementing basic LSTM networks for toxic classification with tensorflow. (refer to DL-ND sentiment_RNN)

In [5]:
import numpy as np
import pandas as pd
import tensorflow as tf

In [6]:
from subprocess import check_output
print(check_output(["ls", "../input"]).decode("utf8"))

glove.6B.100d.txt.zip
glove6b100dtxt
sample_submission.csv
sample_submission.csv.zip
test.csv
test.csv.zip
train.csv
train.csv.zip



# Load data 

In [7]:
train_df = pd.read_csv("../input/train.csv")
test_df = pd.read_csv("../input/test.csv")

print(train_df.shape)

(159571, 8)


In [8]:
train_df.head()

Unnamed: 0,id,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate
0,0000997932d777bf,Explanation\nWhy the edits made under my usern...,0,0,0,0,0,0
1,000103f0d9cfb60f,D'aww! He matches this background colour I'm s...,0,0,0,0,0,0
2,000113f07ec002fd,"Hey man, I'm really not trying to edit war. It...",0,0,0,0,0,0
3,0001b41b1c6bb37e,"""\nMore\nI can't make any real suggestions on ...",0,0,0,0,0,0
4,0001d958c54c6e35,"You, sir, are my hero. Any chance you remember...",0,0,0,0,0,0


In [9]:
min(map(len, train_df.comment_text))

6

# Data preprocessing

In [10]:
class_list = ["toxic", "severe_toxic", "obscene", "threat", "insult", "identity_hate"]
labels = train_df[class_list].values

In [11]:
labels.shape

(159571, 6)

First, let's remove all punctuation. Then get all the text without the newlines and split it into individual words.

In [12]:
from string import punctuation

In [13]:
def clean_string(sentence):
    text = "".join([c for c in sentence if c not in punctuation])
    sens = " ".join(text.split('\n'))
    words = sens.split(" ")
    
    return [word.lower() for word in words if word]

In [14]:
list_sentences_train = train_df["comment_text"].apply(clean_string).values

## Encoding words

The embedding lookup requires that we pass in integers to our network. The easiest way to do this is to create dictionaries that map the words in the vocabulary to integers. Then we can convert each of our reviews into integers so they can be passed into the network.

In [15]:
from collections import Counter

In [16]:
#from functools import reduce

#words = reduce(lambda x,y: x+y, list(list_sentences_train))

In [17]:
import pickle
#pickle.dump( words, open( "words.pkl", "wb" ) )
words = pickle.load(open("words.pkl","rb"))

In [18]:
counts = Counter(words)

In [19]:
vocab = sorted(counts, key=counts.get, reverse=True)

In [20]:
vocab_to_int = {word: ii for ii, word in enumerate(vocab, 1)}

In [21]:
list_tokenized_train = [[vocab_to_int[word] for word in sentence] for sentence in list_sentences_train]

In [22]:
list_sentences_train

array([ ['explanation', 'why', 'the', 'edits', 'made', 'under', 'my', 'username', 'hardcore', 'metallica', 'fan', 'were', 'reverted', 'they', 'werent', 'vandalisms', 'just', 'closure', 'on', 'some', 'gas', 'after', 'i', 'voted', 'at', 'new', 'york', 'dolls', 'fac', 'and', 'please', 'dont', 'remove', 'the', 'template', 'from', 'the', 'talk', 'page', 'since', 'im', 'retired', 'now892053827'],
       ['daww', 'he', 'matches', 'this', 'background', 'colour', 'im', 'seemingly', 'stuck', 'with', 'thanks', 'talk', '2151', 'january', '11', '2016', 'utc'],
       ['hey', 'man', 'im', 'really', 'not', 'trying', 'to', 'edit', 'war', 'its', 'just', 'that', 'this', 'guy', 'is', 'constantly', 'removing', 'relevant', 'information', 'and', 'talking', 'to', 'me', 'through', 'edits', 'instead', 'of', 'my', 'talk', 'page', 'he', 'seems', 'to', 'care', 'more', 'about', 'the', 'formatting', 'than', 'the', 'actual', 'info'],
       ...,
       ['spitzer', 'umm', 'theres', 'no', 'actual', 'article', 'for', '

##  Feature for sentence

In [23]:
max_len = 300
features = np.zeros((len(list_tokenized_train), max_len), dtype=int)
for i, row in enumerate(list_tokenized_train):
    features[i, -len(row):] = np.array(row)[:max_len]

## Dataset split training, validation, test

In [24]:
split_frac = 0.8
split_idx = int(len(features)*0.8)
train_x, val_x = features[:split_idx], features[split_idx:]
train_y, val_y = labels[:split_idx], labels[split_idx:]

test_idx = int(len(val_x)*0.5)
val_x, test_x = val_x[:test_idx], val_x[test_idx:]
val_y, test_y = val_y[:test_idx], val_y[test_idx:]

print("\t\t\tFeature Shapes:")
print("Train set: \t\t{}".format(train_x.shape), 
      "\nValidation set: \t{}".format(val_x.shape),
      "\nTest set: \t\t{}".format(test_x.shape))

			Feature Shapes:
Train set: 		(127656, 300) 
Validation set: 	(15957, 300) 
Test set: 		(15958, 300)


# Build computation graph

Here, we'll build the graph. First up, defining the hyperparameters.

* `lstm_size`: Number of units in the hidden layers in the LSTM cells. Usually larger is better performance wise. Common values are 128, 256, 512, etc.
* `lstm_layers`: Number of LSTM layers in the network. I'd start with 1, then add more if I'm underfitting.
* `batch_size`: The number of reviews to feed the network in one training pass. Typically this should be set as high as you can go without running out of memory.
* `learning_rate`: Learning rate

In [53]:
lstm_size = 256
lstm_layers = 1
batch_size = 500
learning_rate = 0.00002

For the network itself, we'll be passing in our 200 element long review vectors. Each batch will be `batch_size` vectors. We'll also be using dropout on the LSTM layer, so we'll make a placeholder for the keep probability.

In [45]:
graph = tf.Graph()

with graph.as_default():
    
    inputs_ = tf.placeholder(tf.int32, [None, None], name="inputs")
    labels_ = tf.placeholder(tf.int32, [None, 6], name="labels")  
    keep_prob = tf.placeholder(tf.float32, name="dropout_keep")

Create the embedding lookup matrix as a `tf.Variable`. Use that embedding matrix to get the embedded vectors to pass to the LSTM cell with [`tf.nn.embedding_lookup`](https://www.tensorflow.org/api_docs/python/tf/nn/embedding_lookup).

In [46]:
embed_size = 300
n_words = len(vocab_to_int)
with graph.as_default():
    embedding = tf.Variable(tf.random_uniform((n_words, embed_size), -1, 1))
    embed = tf.nn.embedding_lookup(embedding, inputs_)

To create a basic LSTM cell for the graph, you'll want to use tf.contrib.rnn.BasicLSTMCell. Looking at the function documentation: tf.contrib.rnn.BasicLSTMCell(num_units, forget_bias=1.0, input_size=None, state_is_tuple=True, activation=<function tanh at 0x109f1ef28>)
you can see it takes a parameter called num_units, the number of units in the cell, called lstm_size in this code. So then, you can write something like 

* lstm = tf.contrib.rnn.BasicLSTMCell(num_units)
to create an LSTM cell with num_units.

Next, you can add dropout to the cell with tf.contrib.rnn.DropoutWrapper. This just wraps the cell in another cell, but with dropout added to the inputs and/or outputs. It's a really convenient way to make your network better with almost no effort! So you'd do something like
* drop = tf.contrib.rnn.DropoutWrapper(cell, output_keep_prob=keep_prob)

Most of the time, you're network will have better performance with more layers. That's sort of the magic of deep learning, adding more layers allows the network to learn really complex relationships. Again, there is a simple way to create multiple layers of LSTM cells with tf.contrib.rnn.MultiRNNCell:
* cell = tf.contrib.rnn.MultiRNNCell([drop] * lstm_layers)

In [47]:
with graph.as_default():
    #lstm = tf.contrib.rnn.BasicLSTMCell(lstm_size)
    #dropout = tf.contrib.rnn.DropoutWrapper(lstm, output_keep_prob = keep_prob)
    cell = tf.contrib.rnn.MultiRNNCell(
        [tf.contrib.rnn.DropoutWrapper(tf.contrib.rnn.BasicLSTMCell(lstm_size), 
                                                                      output_keep_prob = keep_prob) for _ in range(lstm_layers)])
    
    # Getting an initial state of all zeros
    initial_state = cell.zero_state(batch_size, tf.float32)

Use tf.nn.dynamic_rnn to add the forward pass through the RNN. Remember that we're actually passing in vectors from the embedding layer, embed.

In [48]:
with graph.as_default():
    outputs, final_state = tf.nn.dynamic_rnn(cell, embed, initial_state = initial_state)


In [49]:
with graph.as_default():
    logits = tf.contrib.layers.fully_connected(outputs[:, -1], 6)
    cost = tf.losses.sigmoid_cross_entropy(labels_, logits)
    
    optimizer = tf.train.AdamOptimizer(learning_rate).minimize(cost)

validation accuracy

In [50]:
with graph.as_default():
    correct_pred = tf.equal(tf.cast(tf.argmax(logits), tf.int32), labels_)
    accuracy = tf.reduce_mean(tf.cast(correct_pred, tf.float32))

get bactch

In [51]:
def get_batches(x, y, batch_size=100):
    
    n_batches = len(x)//batch_size
    x, y = x[:n_batches*batch_size], y[:n_batches*batch_size]
    for ii in range(0, len(x), batch_size):
        yield x[ii:ii+batch_size], y[ii:ii+batch_size]

In [None]:
for x,y in get_batches(train_x, train_y, batch_size):
    print(x.shape,y.shape)
    break

# Training

In [None]:
epochs = 10

with graph.as_default():
    saver = tf.train.Saver()

with tf.Session(graph=graph) as sess:
    sess.run(tf.global_variables_initializer())
    iteration = 1
    for e in range(epochs):
        state = sess.run(initial_state)
        
        for ii, (x, y) in enumerate(get_batches(train_x, train_y, batch_size), 1):
            feed = {inputs_: x,
                    labels_: y,
                    keep_prob: 0.5,
                    initial_state: state}
            loss, state, _ = sess.run([cost, final_state, optimizer], feed_dict=feed)
            
            if iteration%5==0:
                print("Epoch: {}/{}".format(e, epochs),
                      "Iteration: {}".format(iteration),
                      "Train loss: {}".format(loss))

            if iteration%25==0:
                val_acc = []
                val_state = sess.run(cell.zero_state(batch_size, tf.float32))
                for x, y in get_batches(val_x, val_y, batch_size):
                    feed = {inputs_: x,
                            labels_: y,
                            keep_prob: 1,
                            initial_state: val_state}
                    batch_acc, val_state = sess.run([accuracy, final_state], feed_dict=feed)
                    val_acc.append(batch_acc)
                print("Val acc: {:.3f}".format(np.mean(val_acc)))
            iteration +=1
    saver.save(sess, "checkpoints/basic_lstm_tf.ckpt")

Epoch: 0/10 Iteration: 5 Train loss: 0.7243062853813171
Epoch: 0/10 Iteration: 10 Train loss: 0.7158084511756897
Epoch: 0/10 Iteration: 15 Train loss: 0.7078432440757751
Epoch: 0/10 Iteration: 20 Train loss: 0.7034446597099304
Epoch: 0/10 Iteration: 25 Train loss: 0.7004823684692383
Val acc: 0.005
Epoch: 0/10 Iteration: 30 Train loss: 0.6967126727104187
Epoch: 0/10 Iteration: 35 Train loss: 0.6952726244926453
Epoch: 0/10 Iteration: 40 Train loss: 0.6944915652275085
Epoch: 0/10 Iteration: 45 Train loss: 0.6940205097198486
Epoch: 0/10 Iteration: 50 Train loss: 0.6935234665870667
Val acc: 0.389
Epoch: 0/10 Iteration: 55 Train loss: 0.6934249401092529
Epoch: 0/10 Iteration: 60 Train loss: 0.6932600736618042
Epoch: 0/10 Iteration: 65 Train loss: 0.693517804145813
Epoch: 0/10 Iteration: 70 Train loss: 0.6932862997055054
Epoch: 0/10 Iteration: 75 Train loss: 0.6931747794151306
Val acc: 0.706
Epoch: 0/10 Iteration: 80 Train loss: 0.6932008266448975
Epoch: 0/10 Iteration: 85 Train loss: 0.69321