## Practical 2:  Classifying TED Talks
<p>A solution to the task for the Deep Learning for NLP 2017 course.<br>
https://www.cs.ox.ac.uk/teaching/courses/2016-2017/dl/</p>
<p>Tasks created by [Yannis Assael, Brendan Shillingford, Chris Dyer]</p>

This 

In [68]:
import numpy as np
import os
from random import shuffle
import re

In [69]:
from bokeh.models import ColumnDataSource, LabelSet
from bokeh.plotting import figure, show, output_file
from bokeh.io import output_notebook
output_notebook()

### Part 0: Downloading and preprocessing

In [70]:
import urllib.request
import zipfile
import lxml.etree

In [71]:
# Download the dataset if it's not already there: this may take a minute as it is 75MB
if not os.path.isfile('ted_en-20160408.zip'):
    urllib.request.urlretrieve("https://wit3.fbk.eu/get.php?path=XML_releases/xml/ted_en-20160408.zip&filename=ted_en-20160408.zip", filename="ted_en-20160408.zip")

In [72]:
# For now, we're only interested in the subtitle text, so let's extract that from the XML:
with zipfile.ZipFile('ted_en-20160408.zip', 'r') as z:
    doc = lxml.etree.parse(z.open('ted_en-20160408.xml', 'r'))
all_text = '\n'.join(doc.xpath('//content/text()'))
talkcontent = doc.xpath('//content/text()')
keywords = doc.xpath('//head/keywords/text()')
talks = [(a,b) for (a,b) in zip(talkcontent, keywords)]

The XML is traversed, and for each talk we obtain the text of the talk, as well as associated keywords. 

In [83]:
talk, tag = talks[4]
print ( "[ " + talk[:500] + " ... ]" )
print( "Tagged as: " + tag)

print(len(talks) , "talks parsed.")

[ Thousands of years from now, we'll look back at the first century of computing as a fascinating but very peculiar time -- the only time in history where humans were reduced to live in 2D space, interacting with technology as if we were machines; a singular, 100-year period in the vastness of time where humans communicated, were entertained and managed their lives from behind a screen.
Today, we spend most of our time tapping and looking at screens. What happened to interacting with each other? I ... ]
Tagged as: talks, NASA, communication, computers, creativity, design, engineering, exploration, future, innovation, interface design, invention, microsoft, potential, prediction, product design, technology, visualizations
2085 talks parsed.


This then needs to be reformatted, so that only the relevant keywords are included, and the text content is tokenised in the same manner as the first practical

In [80]:
def to_label(keywords):
    labels = [x.strip() for x in keywords.split(',')]
    
    label1 = "o"
    label2 = "o"
    label3 = "o"
    
    if "technology" in labels:
        label1 = "T"
    if "entertainment" in labels:
        label2 = "E"
    if "design" in labels:
        label3 = "D"
        
    return label1 + label2 + label3

In [81]:
processed_talks = list()

for (talk, keywords) in talks:
    input_text_noparens = re.sub(r'\([^)]*\)', '', talk)

    sentences_strings_ted = []

    for line in input_text_noparens.split('\n'):
        m = re.match(r'^(?:(?P<precolon>[^:]{,20}):)?(?P<postcolon>.*)$', line)
        sentences_strings_ted.extend(sent for sent in m.groupdict()['postcolon'].split('.') if sent)

        sentences_ted = []
    for sent_str in sentences_strings_ted:
        tokens = re.sub(r"[^a-z0-9]+", " ", sent_str.lower()).split()
        sentences_ted.append(tokens)
        
    #Process keywords    
    label = to_label(keywords)
        
    processed_talks.append( (sentences_ted, label) )

Now all talks are given as a list of tokens, split into sentences, with their keywords.

In [82]:
sentences, tag = processed_talks[4]
print(sentences[0:2])
print(tag)

[['thousands', 'of', 'years', 'from', 'now', 'we', 'll', 'look', 'back', 'at', 'the', 'first', 'century', 'of', 'computing', 'as', 'a', 'fascinating', 'but', 'very', 'peculiar', 'time', 'the', 'only', 'time', 'in', 'history', 'where', 'humans', 'were', 'reduced', 'to', 'live', 'in', '2d', 'space', 'interacting', 'with', 'technology', 'as', 'if', 'we', 'were', 'machines', 'a', 'singular', '100', 'year', 'period', 'in', 'the', 'vastness', 'of', 'time', 'where', 'humans', 'communicated', 'were', 'entertained', 'and', 'managed', 'their', 'lives', 'from', 'behind', 'a', 'screen'], ['today', 'we', 'spend', 'most', 'of', 'our', 'time', 'tapping', 'and', 'looking', 'at', 'screens']]
ToD


In [50]:
for i in range(0,20):
    print ((processed_talks[i])[1])

ooo
ooo
ooo
ooD
ToD
ooo
ooo
ooo
ooo
ToD
ooo
ooD
oEo
ToD
ooo
ooo
ooD
ooo
ToD
ooo


### Part 2: Analysis

Randomly permute the dataset, and keep the last two blocks of 250 for validation and testing

In [51]:
shuffle(processed_talks)

data_training = processed_talks[:-500]
print( str(len(data_training)) + " training items")  
data_testing = processed_talks[-500:-250]
print( str(len(data_testing)) + " testing items") 
data_validation = processed_talks[-250:]
print( str(len(data_validation)) + " validation items")


1585 training items
250 testing items
250 validation items


1. Compare the learning curves of the model starting from random embeddings, starting from GloVe embeddings (http://nlp.stanford.edu/data/glove.6B.zip; 50 dimensions) or fixed to be the GloVe values. Training in batches is more stable (e.g. 50), which model works best on training vs. test? Which model works best on held-out accuracy?
2. What happens if you try alternative non-linearities (logistic sigmoid or ReLU instead of tanh)?
3. What happens if you add dropout to the network?
4. What happens if you vary the size of the hidden layer?
5. How would the code change if you wanted to add a second hidden layer?
6. How does the training algorithm affect the quality of the model?
7. Project the embeddings of the labels onto 2 dimensions and visualise (each row of the projection matrix V corresponds a label embedding). Do you see anything interesting?

In [84]:
# Download the glove embeddings
if not os.path.isfile('glove.zip'):
    urllib.request.urlretrieve("http://nlp.stanford.edu/data/glove.6B.zip", filename="glove.zip")

In [118]:
z = zipfile.ZipFile('glove.zip', 'r') 
glove_lines = [ line.decode("utf-8").strip() for line in z.open('glove.6B.50d.txt', 'r').readlines() ] 
z.extract('glove.6B.50d.txt')

'/home/deepnlp2017/pracs/practical-1/glove.6B.50d.txt'

In [109]:
print(glove_lines[0])

the 0.418 0.24968 -0.41242 0.1217 0.34527 -0.044457 -0.49688 -0.17862 -0.00066023 -0.6566 0.27843 -0.14767 -0.55677 0.14658 -0.0095095 0.011658 0.10204 -0.12792 -0.8443 -0.12181 -0.016801 -0.33279 -0.1552 -0.23131 -0.19181 -1.8823 -0.76746 0.099051 -0.42125 -0.19526 4.0071 -0.18594 -0.52287 -0.31681 0.00059213 0.0074449 0.17778 -0.15897 0.012041 -0.054223 -0.29871 -0.15749 -0.34758 -0.045637 -0.44251 0.18785 0.0027849 -0.18411 -0.11514 -0.78581


In [127]:
embedding_size = 50

def print_embedding(word, embedding) :
    print (word + " embedding: " "[ " + " ".join(embedding) + "]")

glove_embedding = { }

for line in glove_lines:
    words = line.split()
    word = words[0]
    embedding = words[1:]
    glove_embedding[word] = embedding

random_embedding = list()

print( "Glove embedding: ")
print_embedding("the", glove_embedding["the"])
print( "Random embedding ")
print_embedding("the", random_embedding["the"])

 Glove embedding: 
the embedding: [ 0.418 0.24968 -0.41242 0.1217 0.34527 -0.044457 -0.49688 -0.17862 -0.00066023 -0.6566 0.27843 -0.14767 -0.55677 0.14658 -0.0095095 0.011658 0.10204 -0.12792 -0.8443 -0.12181 -0.016801 -0.33279 -0.1552 -0.23131 -0.19181 -1.8823 -0.76746 0.099051 -0.42125 -0.19526 4.0071 -0.18594 -0.52287 -0.31681 0.00059213 0.0074449 0.17778 -0.15897 0.012041 -0.054223 -0.29871 -0.15749 -0.34758 -0.045637 -0.44251 0.18785 0.0027849 -0.18411 -0.11514 -0.78581]
 Random embedding 


TypeError: list indices must be integers or slices, not str

We need to map words which arent in our embedding to a special token, which we may define as '~'

We also need a method to prepare the word embedding vector

In [157]:
def get_token(word, embedding_vec) :
    if (word in embedding_vec):
        return word
    else:
        return '~'
    
def get_embedding(use_glove = True):
    if use_glove:
        return glove_embedding
    else:
        return random_embedding
    
def tokenise_talk(talk, embedding_vec) :
    return [get_token(x, embedding_vec) for x in talk[0]]

In [158]:
#https://www.tensorflow.org/tutorials/word2vec 

#We need to find the embedding size, this can be either retrieved from the glove dataset or by taking our embedding size as
#the size of our largest sentence

#practical 1 learnt embeddings by using the Word2Vec(sentences, size=100, min_count=10)
#Not sure if this should still be used

tokenised = tokenise_talk(processed_talks[0][0], glove_embedding)

print(tokenised)

['here', 'are', 'two', 'reasons', 'companies', 'fail', 'they', 'only', 'do', 'more', 'of', 'the', 'same', 'or', 'they', 'only', 'do', 'what', 's', 'new']


In [160]:
import tensorflow as tf
import math
vocabulary_size = 4
batch_size = 50

In [161]:
#Function to return (embedding, loss func, optimizer , ... )
#1.random embeddings
random_embedding = tf.Variable(
    tf.random_uniform([vocabulary_size, embedding_size], -1.0, 1.0))

#2. start from GloVe

#3. fixed as GloVe

nce_weights = tf.Variable(
  tf.truncated_normal([vocabulary_size, embedding_size],
                      stddev=1.0 / math.sqrt(embedding_size)))

nce_biases = tf.Variable(tf.zeros([vocabulary_size]))

In [41]:

train_inputs = tf.placeholder(tf.int32, shape=[batch_size])
train_labels = tf.placeholder(tf.int32, shape=[batch_size, 1])
embed = tf.nn.embedding_lookup(embeddings, train_inputs)

In [None]:
#Adjust non-linearirities

#Adjust dropout

#Adjust hidden layer size

In [42]:
#What is our loss function, hint not this
loss = tf.reduce_mean(
  tf.nn.nce_loss(nce_weights, nce_biases, embed, train_labels,
                 num_sampled, vocabulary_size))

NameError: name 'num_sampled' is not defined

In [44]:
#We should use Adam as optimiser
optimizer = tf.train.AdamOptimizer()

In [None]:
#Code to add second hidden layer

## Training

## Visualisation

In [None]:
#tSNE visualisation