## Practical 2:  Classifying TED Talks
<p>A solution to the task for the Deep Learning for NLP 2017 course.<br>
https://www.cs.ox.ac.uk/teaching/courses/2016-2017/dl/</p>
<p>Tasks created by [Yannis Assael, Brendan Shillingford, Chris Dyer]</p>

This 

In [1]:
import numpy as np
import os
from random import shuffle
import re

In [2]:
from bokeh.models import ColumnDataSource, LabelSet
from bokeh.plotting import figure, show, output_file
from bokeh.io import output_notebook
output_notebook()

### Part 0: Downloading and preprocessing

In [3]:
import urllib.request
import zipfile
import lxml.etree

In [4]:
# Download the dataset if it's not already there: this may take a minute as it is 75MB
if not os.path.isfile('ted_en-20160408.zip'):
    urllib.request.urlretrieve("https://wit3.fbk.eu/get.php?path=XML_releases/xml/ted_en-20160408.zip&filename=ted_en-20160408.zip", filename="ted_en-20160408.zip")

In [5]:
# For now, we're only interested in the subtitle text, so let's extract that from the XML:
with zipfile.ZipFile('ted_en-20160408.zip', 'r') as z:
    doc = lxml.etree.parse(z.open('ted_en-20160408.xml', 'r'))
all_text = '\n'.join(doc.xpath('//content/text()'))
talkcontent = doc.xpath('//content/text()')
keywords = doc.xpath('//head/keywords/text()')
talks = [(a,b) for (a,b) in zip(talkcontent, keywords)]

The XML is traversed, and for each talk we obtain the text of the talk, as well as associated keywords. 

In [6]:
talk, tag = talks[4]
print ( "[ " + talk[:500] + " ... ]" )
print( "Tagged as: " + tag)

print(len(talks) , "talks parsed.")

[ Thousands of years from now, we'll look back at the first century of computing as a fascinating but very peculiar time -- the only time in history where humans were reduced to live in 2D space, interacting with technology as if we were machines; a singular, 100-year period in the vastness of time where humans communicated, were entertained and managed their lives from behind a screen.
Today, we spend most of our time tapping and looking at screens. What happened to interacting with each other? I ... ]
Tagged as: talks, NASA, communication, computers, creativity, design, engineering, exploration, future, innovation, interface design, invention, microsoft, potential, prediction, product design, technology, visualizations
2085 talks parsed.


This then needs to be reformatted, so that only the relevant keywords are included, and the text content is tokenised in the same manner as the first practical

In [128]:
def to_label(keywords):
    labels = [x.strip() for x in keywords.split(',')]
    
    label1 = "o"
    label2 = "o"
    label3 = "o"
    
    if "technology" in labels:
        label1 = "T"
    if "entertainment" in labels:
        label2 = "E"
    if "design" in labels:
        label3 = "D"
        
    return label1 + label2 + label3

def to_one_hot(label):
    if label == "ooo":
        return [1,0,0,0,0,0,0,0]
    if label == "Too":
        return [0,1,0,0,0,0,0,0]
    if label == "o£o":
        return [0,0,1,0,0,0,0,0]
    if label == "ooD":
        return [0,0,0,1,0,0,0,0]
    if label == "TEo":
        return [0,0,0,0,1,0,0,0]
    if label == "oED":
        return [0,0,0,0,0,1,0,0]
    if label == "ToD":
        return [0,0,0,0,0,0,1,0]
    if label == "TED":
        return [0,0,0,0,0,0,0,1]

In [102]:
processed_talks = list()
labels = list()

for (talk, keywords) in talks:
    input_text_noparens = re.sub(r'\([^)]*\)', '', talk)

    sentences_strings_ted = []

    for line in input_text_noparens.split('\n'):
        m = re.match(r'^(?:(?P<precolon>[^:]{,20}):)?(?P<postcolon>.*)$', line)
        sentences_strings_ted.extend(sent for sent in m.groupdict()['postcolon'].split('.') if sent)

    sentences_ted = []
        
    for sent_str in sentences_strings_ted:
        tokens = re.sub(r"[^a-z0-9]+", " ", sent_str.lower()).split()
        sentences_ted.extend(tokens)
    
    #There are a couple of talks without any actual parsable content - i.e. id 93. We will skip these
    if(len(sentences_ted) == 0):
        continue
        
    #Process keywords    
    label = to_label(keywords)
    
    processed_talks.append(sentences_ted)
    labels.append(label)


Now all talks are given as a list of tokens, with their keywords.

In [103]:
words = processed_talks[4]
label = labels[4]
print(words[0:50])
print(label)

['thousands', 'of', 'years', 'from', 'now', 'we', 'll', 'look', 'back', 'at', 'the', 'first', 'century', 'of', 'computing', 'as', 'a', 'fascinating', 'but', 'very', 'peculiar', 'time', 'the', 'only', 'time', 'in', 'history', 'where', 'humans', 'were', 'reduced', 'to', 'live', 'in', '2d', 'space', 'interacting', 'with', 'technology', 'as', 'if', 'we', 'were', 'machines', 'a', 'singular', '100', 'year', 'period', 'in']
ToD


In [104]:
for i in range(0,20):
    print (labels[i])

ooo
ooo
ooo
ooD
ToD
ooo
ooo
ooo
ooo
ToD
ooo
ooD
oEo
ToD
ooo
ooo
ooD
ooo
ToD
ooo


### Part 2: Analysis

Randomly permute the dataset, and keep the last two blocks of 250 for validation and testing

In [105]:
zipped = list(zip(processed_talks, labels))

shuffle(zipped)

shuffled_talks, shuffeled_labels = zip(*zipped)

data_training = shuffled_talks[:-500]
labels_training = shuffeled_labels[:-500]
print( str(len(data_training)) + " training items")  

data_testing = shuffled_talks[-500:-250]
labels_testing = shuffeled_labels[-500:-250]
print( str(len(data_testing)) + " testing items") 

data_validation = shuffled_talks[-250:]
labels_validation = shuffeled_labels[-250:]
print( str(len(data_validation)) + " validation items")

test_size = len(data_training)
validation_size = len(data_validation)

1578 training items
250 testing items
250 validation items


1. Compare the learning curves of the model starting from random embeddings, starting from GloVe embeddings (http://nlp.stanford.edu/data/glove.6B.zip; 50 dimensions) or fixed to be the GloVe values. Training in batches is more stable (e.g. 50), which model works best on training vs. test? Which model works best on held-out accuracy?
2. What happens if you try alternative non-linearities (logistic sigmoid or ReLU instead of tanh)?
3. What happens if you add dropout to the network?
4. What happens if you vary the size of the hidden layer?
5. How would the code change if you wanted to add a second hidden layer?
6. How does the training algorithm affect the quality of the model?
7. Project the embeddings of the labels onto 2 dimensions and visualise (each row of the projection matrix V corresponds a label embedding). Do you see anything interesting?

In [106]:
# Download the glove embeddings
if not os.path.isfile('glove.zip'):
    urllib.request.urlretrieve("http://nlp.stanford.edu/data/glove.6B.zip", filename="glove.zip")

In [107]:
z = zipfile.ZipFile('glove.zip', 'r') 
glove_lines = [ line.decode("utf-8").strip() for line in z.open('glove.6B.50d.txt', 'r').readlines() ] 
z.extract('glove.6B.50d.txt')

'F:\\deeplearning\\glove.6B.50d.txt'

In [108]:
print(glove_lines[0])

the 0.418 0.24968 -0.41242 0.1217 0.34527 -0.044457 -0.49688 -0.17862 -0.00066023 -0.6566 0.27843 -0.14767 -0.55677 0.14658 -0.0095095 0.011658 0.10204 -0.12792 -0.8443 -0.12181 -0.016801 -0.33279 -0.1552 -0.23131 -0.19181 -1.8823 -0.76746 0.099051 -0.42125 -0.19526 4.0071 -0.18594 -0.52287 -0.31681 0.00059213 0.0074449 0.17778 -0.15897 0.012041 -0.054223 -0.29871 -0.15749 -0.34758 -0.045637 -0.44251 0.18785 0.0027849 -0.18411 -0.11514 -0.78581


In [109]:
import tensorflow as tf

In [110]:
glove_embedding = { }

for line in glove_lines:
    words = line.split()
    word = words[0]
    embedding = [float(x) for x in words[1:] ]#map(float, words[1:])
    glove_embedding[word] = embedding

In [111]:
embedding_size = 50
vocabulary_size = len(glove_lines)

def print_embedding(word, embedding) :
    print (word + " embedding: " "[ " + " ".join(map(str, embedding)) + "]")

print( "Glove embedding: ")
print_embedding("the", glove_embedding["the"])

print("Vocabulary Size: ", vocabulary_size)


Glove embedding: 
the embedding: [ 0.418 0.24968 -0.41242 0.1217 0.34527 -0.044457 -0.49688 -0.17862 -0.00066023 -0.6566 0.27843 -0.14767 -0.55677 0.14658 -0.0095095 0.011658 0.10204 -0.12792 -0.8443 -0.12181 -0.016801 -0.33279 -0.1552 -0.23131 -0.19181 -1.8823 -0.76746 0.099051 -0.42125 -0.19526 4.0071 -0.18594 -0.52287 -0.31681 0.00059213 0.0074449 0.17778 -0.15897 0.012041 -0.054223 -0.29871 -0.15749 -0.34758 -0.045637 -0.44251 0.18785 0.0027849 -0.18411 -0.11514 -0.78581]
Vocabulary Size:  400000


We need to map words which arent in our embedding to a special token, which we may define as '~'

We also need a method to prepare the word embedding vector

In [112]:
def get_token(word, embedding_vec) :
    if (word in embedding_vec):
        return word
    else:
        return '~'
    
#Use bag of words structure - average over all word emebddings
def get_embedding(text, embedding):
        
    if (len(text) == 0):
        print("oop")
        
    mean = np.array(np.zeros(embedding_size))
    length = 0
    
    for word in text:
        length += 1   
        mean += embedding[get_token(word, embedding)]
                   
    mean /= length
    
    return mean
    

In [135]:
#https://www.tensorflow.org/tutorials/word2vec 

#We need to find the embedding size, this can be either retrieved from the glove dataset or by taking our embedding size as
#the size of our largest sentence

#practical 1 learnt embeddings by using the Word2Vec(sentences, size=100, min_count=10)
#Not sure if this should still be used

embedding = get_embedding(processed_talks[0][0], glove_embedding)

print("Words: " + str(processed_talks[0][0][:10]) + "....")
print("Embedding for text: " + str(embedding))

embed_training = [get_embedding(words, glove_embedding) for words in data_training]
embed_testing = [get_embedding(words, glove_embedding) for words in data_testing]
embed_validation = [get_embedding(words, glove_embedding) for words in data_validation]

print(embed_training[0])

Words: here....
Embedding for text: [ 0.20422     0.69111     1.0191175   0.9112025  -0.1018875  -0.2443
  0.30575    -0.762815    0.01082     0.0658945  -0.607105    0.464846
 -0.4133975  -0.73253    -0.2932875  -0.585892   -0.5807895  -0.3730975
  0.0764475  -0.38028    -0.249845   -0.44856     0.9568675   0.8807275
 -0.2546175  -0.69559325 -0.23883875 -0.34764425 -0.06621    -0.434351
  2.355375   -0.29038925 -0.1535075   0.17053    -0.075032   -0.99037
  0.344277   -0.4631875   0.308017    0.26623825  1.37863     0.2944975
 -0.2656045   0.0956935   0.0872225   0.371975    0.21221125 -0.366875
  0.2746625   1.48315   ]
[  3.40829947e-01   1.94150592e-01  -8.39574626e-02  -1.19638391e-01
   3.50347154e-01   1.54856151e-01  -3.01097553e-01  -1.89412729e-01
  -7.95121311e-02   2.78766527e-03   7.24697912e-02   8.34275427e-02
  -2.26032122e-01  -1.08499825e-01   2.87796247e-01   1.87911087e-01
   1.28870968e-01   2.41990829e-02  -2.39164143e-01  -3.54875595e-01
   6.72452124e-02   1.246

In [114]:
import math
num_classes = 8
hidden_size = 32

batch_size = 50

### Network Structure
x = embedding(text)

h = tanh(Wx + b)

u = Vh + c

p = softmax(u)

prediction = argmax(p)

In [137]:
#Function to return (embedding, loss func, optimizer , ... )

#1.random embeddings
#2. start from GloVe
#3. fixed as GloVe

#train_inputs = tf.placeholder(tf.int32, shape=[batch_size])
#train_labels = tf.placeholder(tf.int32, shape=[batch_size, 1])

#embedding_placeholder = tf.Variable(tf.random_uniform([vocabulary_size, embedding_size], -1.0, 1.0))

#x = tf.placeholder("float", shape=[None, embedding_size])

x = tf.placeholder("float", shape=[1, embedding_size])
y = tf.placeholder("float", shape=[1, num_classes])

#embed = tf.nn.embedding_lookup(embedding_placeholder, x)

W_weights = tf.Variable(
  tf.truncated_normal([embedding_size, hidden_size],
                      stddev=1.0 / math.sqrt(embedding_size)))

V_weights = tf.Variable(
  tf.truncated_normal([hidden_size, num_classes],
                      stddev=1.0 / math.sqrt(embedding_size)))

b = tf.Variable(tf.zeros([hidden_size]))
c = tf.Variable(tf.zeros([num_classes]))

# h = tanh(Wx + b)
h = tf.tanh(tf.add(
        tf.matmul(x, W_weights), b))
            
#Here, we can add a number of more hidden layers, i.e. a h' between h and u, which is built
#similarily to h - activation ( hidden_weights * prev_layer + hidden_biases )

#u = Vh + c
u = tf.add(tf.matmul(h, V_weights), c)
            
p = tf.nn.softmax(u)
#predict = tf.argmax(p, axis = 0)

print("X : ", x)
print("W : ", W_weights)
print("h : ", h)
print("V : ", V_weights)
print("u : ", u)
print("softmax : ", p)
#print("argmax: ", predict)

#loss = tf.reduce_mean(
#  tf.nn.nce_loss(nce_weights, nce_biases, embed, train_labels,
#                 num_sampled, vocabulary_size))
#loss = tf.nn.softmax_cross_entropy_with_logits(predict, train_labels)

X :  Tensor("Placeholder_20:0", shape=(1, 50), dtype=float32)
W :  Tensor("Variable_76/read:0", shape=(50, 32), dtype=float32)
h :  Tensor("Tanh_6:0", shape=(1, 32), dtype=float32)
V :  Tensor("Variable_77/read:0", shape=(32, 8), dtype=float32)
u :  Tensor("Add_13:0", shape=(1, 8), dtype=float32)
softmax :  Tensor("Softmax_1:0", shape=(1, 8), dtype=float32)


In [None]:
#Adjust non-linearirities

#Adjust dropout

#Adjust hidden layer size

In [117]:
#We should use Adam as optimiser
cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=u, labels=y))
optimizer = tf.train.AdamOptimizer().minimize(cost)

In [131]:

def train(graph, embedding, embedding_name, optimizer, loss, dropout = False, embedding_trainable = True):
    init = tf.global_variables_initializer()
    
    with tf.Session() as session:
        init.run()
        print("Initialised training.")
        
        #_embedding = tf.get_variable(name=embedding_name, shape=embedding.shape, 
        #        initializer=tf.constant_initializer(embedding), trainable=embedding_trainable)
        
        num_batches = (int) (test_size / batch_size)
        
        for batch_index in range(0, num_batches):
            start_index = batch_size * batch_index
            end_index = (batch_size + 1) * batch_index - 1

            text_train = embed_training[start_index : end_index]
            labels_train = to_one_hot(labels_training[start_index : end_index])
            
            text_test = embed_testing[start_index : end_index]
            labels_test = to_one_hot(labels_testing[start_index : end_index])
            
            session.run([optimizer, loss], 
                feed_dict={x: text_train, y: labels_train})
            
            train_accuracy = np.mean(np.argmax(train_y, axis=1) ==
                session.run(predict, feed_dict={x: text_train, y: labels_train}))
            
            test_accuracy  = np.mean(np.argmax(test_y, axis=1) ==
                session.run(predict, feed_dict={x: text_test, y: labels_test}))
            
            print("train accuracy = %.2f%%, test accuracy = %.2f%%"
              % (100. * train_accuracy, 100. * test_accuracy))
            
        session.close()
        
        

## Training

Investigating training random embeddings on the vocabulary:

In [124]:
glove_embeddings_fixed = tf.Variable(
    tf.random_uniform([vocabulary_size, embedding_size], -1.0, 1.0), trainable = False)

glove_embeddings_trainable = tf.Variable(
    tf.random_uniform([vocabulary_size, embedding_size], -1.0, 1.0), trainable = True)

embed_random = tf.nn.embedding_lookup(random_embedding, train_inputs)

#loss_random = tf.reduce_mean(
#      tf.nn.nce_loss(nce_weights, nce_biases, embed_random, train_labels,
#                 num_sampled, vocabulary_size))

#loss_logits = tf.reduce_mean(
#    tf.nn.softmax_cross_entropy_with_logits(
#        labels=train_labels, logits=yhat ) )

In [134]:
graph = tf.Graph()

random_embedding = tf.Variable(
    tf.random_uniform([vocabulary_size, embedding_size], -1.0, 1.0), trainable = True)

train(graph, random_embedding, 'random_embedding', optimizer, cost, embedding_trainable = True)

Initialised training.


ValueError: Cannot feed value of shape (1577, 50) for Tensor 'Placeholder_16:0', which has shape '(1, 50)'

## Visualisation

In [None]:
#tSNE visualisation