## Practical 1: word2vec
<p>Oxford CS - Deep NLP 2017<br>
https://www.cs.ox.ac.uk/teaching/courses/2016-2017/dl/</p>
<p>[Yannis Assael, Brendan Shillingford, Chris Dyer]</p>

This practical is presented as an IPython Notebook, with the code written for recent versions of **Python 3**. The code in this practical will not work with Python 2 unless you modify it. If you are using your own Python installation, ensure you have a setup identical to that described in the installation shell script (which is intended for use with the department lab machines). We will be unable to support installation on personal machines due to time constraints, so please use the lab machines and the setup script if you are unfamiliar with how to install Anaconda.

To execute a notebook cell, press `shift-enter`. The return value of the last command will be displayed, if it is not `None`.

Potentially useful library documentation, references, and resources:

* IPython notebooks: <https://ipython.org/ipython-doc/3/notebook/notebook.html#introduction>
* Numpy numerical array library: <https://docs.scipy.org/doc/>
* Gensim's word2vec: <https://radimrehurek.com/gensim/models/word2vec.html>
* Bokeh interactive plots: <http://bokeh.pydata.org/en/latest/> (we provide plotting code here, but click the thumbnails for more examples to copy-paste)
* scikit-learn ML library (aka `sklearn`): <http://scikit-learn.org/stable/documentation.html>
* nltk NLP toolkit: <http://www.nltk.org/>
* tutorial for processing xml in python using `lxml`: <http://lxml.de/tutorial.html> (we did this for you below, but in case you need it in the future)

In [2]:
import numpy as np
import os
from random import shuffle
import re

In [3]:
from bokeh.models import ColumnDataSource, LabelSet
from bokeh.plotting import figure, show, output_file
from bokeh.io import output_notebook
output_notebook()

### Part 0: Download the TED dataset

In [4]:
import urllib
import zipfile
import lxml.etree

In [5]:
# Download the dataset if it's not already there: this may take a minute as it is 75MB
if not os.path.isfile('ted_en-20160408.zip'):
    urllib.urlretrieve("https://wit3.fbk.eu/get.php?path=XML_releases/xml/ted_en-20160408.zip&filename=ted_en-20160408.zip", filename="ted_en-20160408.zip")

In [6]:
# For now, we're only interested in the subtitle text, so let's extract that from the XML:
with zipfile.ZipFile('ted_en-20160408.zip', 'r') as z:
    doc = lxml.etree.parse(z.open('ted_en-20160408.xml', 'r'))
input_text = '\n'.join(doc.xpath('//content/text()'))
del doc

### Part 1: Preprocessing

In this part, we attempt to clean up the raw subtitles a bit, so that we get only sentences. The following substring shows examples of what we're trying to get rid of. Since it's hard to define precisely what we want to get rid of, we'll just use some simple heuristics.

In [7]:
i = input_text.find("Hyowon Gweon: See this?")
input_text[i-20:i+150]

u' baby does.\n(Video) Hyowon Gweon: See this? (Ball squeaks) Did you see that? (Ball squeaks) Cool. See this one? (Ball squeaks) Wow.\nLaura Schulz: Told you. (Laughs)\n(Vide'

Let's start by removing all parenthesized strings using a regex:

In [8]:
input_text_noparens = re.sub(r'\([^)]*\)', '', input_text)

We can verify the same location in the text is now clean as follows. We won't worry about the irregular spaces since we'll later split the text into sentences and tokenize it anyway.

In [9]:
i = input_text_noparens.find("Hyowon Gweon: See this?")
input_text_noparens[i-20:i+150]

u"hat the baby does.\n Hyowon Gweon: See this?  Did you see that?  Cool. See this one?  Wow.\nLaura Schulz: Told you. \n HG: See this one?  Hey Clara, this one's for you. You "

Now, let's attempt to remove speakers' names that occur at the beginning of a line, by deleting pieces of the form "`<up to 20 characters>:`", as shown in this example. Of course, this is an imperfect heuristic. 

In [10]:
sentences_strings_ted = []
for line in input_text_noparens.split('\n'):
    m = re.match(r'^(?:(?P<precolon>[^:]{,20}):)?(?P<postcolon>.*)$', line)
    sentences_strings_ted.extend(sent for sent in m.groupdict()['postcolon'].split('.') if sent)

# Uncomment if you need to save some RAM: these strings are about 50MB.
# del input_text, input_text_noparens

# Let's view the first few:
sentences_strings_ted[:5]

[u"Here are two reasons companies fail: they only do more of the same, or they only do what's new",
 u'To me the real, real solution to quality growth is figuring out the balance between two activities: exploration and exploitation',
 u' Both are necessary, but it can be too much of a good thing',
 u'Consider Facit',
 u" I'm actually old enough to remember them"]

Now that we have sentences, we're ready to tokenize each of them into words. This tokenization is imperfect, of course. For instance, how many tokens is "can't", and where/how do we split it? We'll take the simplest naive approach of splitting on spaces. Before splitting, we remove non-alphanumeric characters, such as punctuation. You may want to consider the following question: why do we replace these characters with spaces rather than deleting them? Think of a case where this yields a different answer.

In [11]:
sentences_ted = []
for sent_str in sentences_strings_ted:
    tokens = re.sub(r"[^a-z0-9]+", " ", sent_str.lower()).split()
    sentences_ted.append(tokens)

Two sample processed sentences:

In [12]:
len(sentences_ted)

266694

In [13]:
print(sentences_ted[0])
print(sentences_ted[1])

[u'here', u'are', u'two', u'reasons', u'companies', u'fail', u'they', u'only', u'do', u'more', u'of', u'the', u'same', u'or', u'they', u'only', u'do', u'what', u's', u'new']
[u'to', u'me', u'the', u'real', u'real', u'solution', u'to', u'quality', u'growth', u'is', u'figuring', u'out', u'the', u'balance', u'between', u'two', u'activities', u'exploration', u'and', u'exploitation']


### list of the most common words and their occurence counts

In [42]:
from collections import Counter
from itertools import chain

counts = Counter(chain.from_iterable(sentences_ted))
print(counts.most_common(1000))

[(u'the', 207748), (u'and', 149305), (u'to', 125169), (u'of', 114818), (u'a', 105399), (u'that', 95146), (u'i', 83180), (u'in', 78070), (u'it', 74738), (u'you', 70923), (u'we', 67629), (u'is', 63251), (u's', 57156), (u'this', 49241), (u'so', 37014), (u'they', 33102), (u'was', 30806), (u'for', 29713), (u'are', 27995), (u'have', 27344), (u'but', 26732), (u'what', 26519), (u'on', 25962), (u'with', 24706), (u'can', 23377), (u't', 22757), (u'about', 21246), (u'there', 21041), (u'be', 20201), (u'as', 19488), (u'at', 19216), (u'all', 19021), (u'not', 18626), (u'do', 17928), (u'my', 17908), (u'one', 17551), (u're', 17012), (u'people', 16723), (u'like', 16273), (u'if', 15868), (u'from', 15452), (u'now', 14387), (u'our', 14061), (u'he', 13986), (u'an', 13917), (u'just', 13896), (u'these', 13882), (u'or', 13864), (u'when', 13278), (u'because', 12879), (u'very', 12363), (u'me', 12302), (u'out', 12163), (u'by', 11935), (u'them', 11595), (u'how', 11576), (u'know', 11506), (u'up', 11445), (u'going', 

In [64]:
# ...
from collections import Counter
from itertools import chain

counts = Counter(chain.from_iterable(sentences_ted))
#print(counts.most_common(1000))
for letter, count in counts.most_common(1000):
    print '%s: %7d' % (letter, count)

the:  207748
and:  149305
to:  125169
of:  114818
a:  105399
that:   95146
i:   83180
in:   78070
it:   74738
you:   70923
we:   67629
is:   63251
s:   57156
this:   49241
so:   37014
they:   33102
was:   30806
for:   29713
are:   27995
have:   27344
but:   26732
what:   26519
on:   25962
with:   24706
can:   23377
t:   22757
about:   21246
there:   21041
be:   20201
as:   19488
at:   19216
all:   19021
not:   18626
do:   17928
my:   17908
one:   17551
re:   17012
people:   16723
like:   16273
if:   15868
from:   15452
now:   14387
our:   14061
he:   13986
an:   13917
just:   13896
these:   13882
or:   13864
when:   13278
because:   12879
very:   12363
me:   12302
out:   12163
by:   11935
them:   11595
how:   11576
know:   11506
up:   11445
going:   11366
had:   10902
more:   10900
think:   10463
who:   10446
were:   10180
see:   10179
your:   10091
their:   10029
which:   10021
would:    9911
here:    9872
really:    9675
get:    9376
ve:    9312
then:    9239
m:    9160
world:    890

### Part 2: Word Frequencies

If you store the counts of the top 1000 words in a list called `counts_ted_top1000`, the code below will plot the histogram requested in the writeup.

### Storing the list of word counts in counts_ted_top1000

In [65]:
counts_ted_top1000 = []
for letter, count in counts.most_common(1000):
    counts_ted_top1000.append(count)
print(counts_ted_top1000)

[207748, 149305, 125169, 114818, 105399, 95146, 83180, 78070, 74738, 70923, 67629, 63251, 57156, 49241, 37014, 33102, 30806, 29713, 27995, 27344, 26732, 26519, 25962, 24706, 23377, 22757, 21246, 21041, 20201, 19488, 19216, 19021, 18626, 17928, 17908, 17551, 17012, 16723, 16273, 15868, 15452, 14387, 14061, 13986, 13917, 13896, 13882, 13864, 13278, 12879, 12363, 12302, 12163, 11935, 11595, 11576, 11506, 11445, 11366, 10902, 10900, 10463, 10446, 10180, 10179, 10091, 10029, 10021, 9911, 9872, 9675, 9376, 9312, 9239, 9160, 8906, 8841, 8823, 8629, 8262, 8256, 8011, 7858, 7735, 7724, 7617, 7537, 7536, 7281, 7227, 7183, 7168, 7141, 7119, 6937, 6806, 6521, 6424, 6358, 6307, 6292, 6229, 6207, 5972, 5907, 5695, 5684, 5571, 5552, 5501, 5456, 5379, 5374, 5312, 5275, 5220, 5197, 5175, 5153, 5133, 4935, 4929, 4919, 4660, 4546, 4544, 4511, 4485, 4454, 4399, 4299, 4143, 4132, 4130, 4120, 4079, 4034, 3971, 3968, 3799, 3798, 3791, 3787, 3786, 3707, 3611, 3494, 3449, 3364, 3346, 3301, 3245, 3217, 3201, 31

Plot distribution of top-1000 words

In [66]:
hist, edges = np.histogram(counts_ted_top1000, density=True, bins=100, normed=True)

p = figure(tools="pan,wheel_zoom,reset,save",
           toolbar_location="above",
           title="Top-1000 words distribution")
p.quad(top=hist, bottom=0, left=edges[:-1], right=edges[1:], line_color="#555555")
show(p)

### Part 3: Train Word2Vec

In [70]:
from gensim.models import Word2Vec



In [78]:
# ...
model_ted = Word2Vec(sentences_ted, size=100, window=5, min_count=5, workers=4)

In [81]:
model_ted.similarity('and', 'for')
len(model_ted.wv.vocab)

21444

### Part 4: Ted Learnt Representations

Finding similar words: (see gensim docs for more functionality of `most_similar`)

In [80]:
model_ted.most_similar("man")

[(u'woman', 0.8587267398834229),
 (u'guy', 0.8015537261962891),
 (u'lady', 0.7703568935394287),
 (u'girl', 0.7561986446380615),
 (u'boy', 0.7534090280532837),
 (u'gentleman', 0.7350908517837524),
 (u'soldier', 0.7309730052947998),
 (u'kid', 0.6917425990104675),
 (u'poet', 0.6867652535438538),
 (u'person', 0.6607731580734253)]

In [82]:
model_ted.most_similar("computer")

[(u'machine', 0.7316159605979919),
 (u'software', 0.7257976531982422),
 (u'device', 0.7119448184967041),
 (u'robot', 0.7004610300064087),
 (u'camera', 0.6708080768585205),
 (u'3d', 0.6697149276733398),
 (u'program', 0.6578587293624878),
 (u'chip', 0.645690381526947),
 (u'mechanical', 0.6343941688537598),
 (u'visualization', 0.6279263496398926)]

In [84]:
# ...
words_top_ted = []
for letter, count in counts.most_common(1000):
    words_top_ted.append(letter)
print(words_top_ted)

[u'the', u'and', u'to', u'of', u'a', u'that', u'i', u'in', u'it', u'you', u'we', u'is', u's', u'this', u'so', u'they', u'was', u'for', u'are', u'have', u'but', u'what', u'on', u'with', u'can', u't', u'about', u'there', u'be', u'as', u'at', u'all', u'not', u'do', u'my', u'one', u're', u'people', u'like', u'if', u'from', u'now', u'our', u'he', u'an', u'just', u'these', u'or', u'when', u'because', u'very', u'me', u'out', u'by', u'them', u'how', u'know', u'up', u'going', u'had', u'more', u'think', u'who', u'were', u'see', u'your', u'their', u'which', u'would', u'here', u'really', u'get', u've', u'then', u'm', u'world', u'us', u'time', u'some', u'has', u'don', u'actually', u'into', u'way', u'where', u'will', u'years', u'things', u'other', u'no', u'could', u'go', u'well', u'want', u'been', u'make', u'right', u'she', u'said', u'something', u'those', u'first', u'two', u'than', u'much', u'also', u'look', u'new', u'thing', u'little', u'got', u'back', u'over', u'most', u'say', u'even', u'his', u'

#### t-SNE visualization
To use the t-SNE code below, first put a list of the top 1000 words (as strings) into a variable `words_top_ted`. The following code gets the corresponding vectors from the model, assuming it's called `model_ted`:

In [85]:
# This assumes words_top_ted is a list of strings, the top 1000 words
words_top_vec_ted = model_ted[words_top_ted]

In [86]:
from sklearn.manifold import TSNE
tsne = TSNE(n_components=2, random_state=0)
words_top_ted_tsne = tsne.fit_transform(words_top_vec_ted)

In [87]:
p = figure(tools="pan,wheel_zoom,reset,save",
           toolbar_location="above",
           title="word2vec T-SNE for most common words")

source = ColumnDataSource(data=dict(x1=words_top_ted_tsne[:,0],
                                    x2=words_top_ted_tsne[:,1],
                                    names=words_top_ted))

p.scatter(x="x1", y="x2", size=8, source=source)

labels = LabelSet(x="x1", y="x2", text="names", y_offset=6,
                  text_font_size="8pt", text_color="#555555",
                  source=source, text_align='center')
p.add_layout(labels)

show(p)

### Part 5: Wiki Learnt Representations

Download dataset

In [89]:
if not os.path.isfile('wikitext-103-raw-v1.zip'):
    urllib.urlretrieve("https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-103-raw-v1.zip", filename="wikitext-103-raw-v1.zip")

In [91]:
with zipfile.ZipFile('wikitext-103-raw-v1.zip', 'r') as z:
    #input_text = str(z.open('wikitext-103-raw/wiki.train.raw', 'r').read(), encoding='utf-8') # Thanks Robert Bastian
    input_text = str(z.open('wikitext-103-raw/wiki.train.raw', 'r').read()) # Thanks Robert Bastian

Preprocess sentences (note that it's important to remove small sentences for performance)

In [92]:
sentences_wiki = []
for line in input_text.split('\n'):
    s = [x for x in line.split('.') if x and len(x.split()) >= 5]
    sentences_wiki.extend(s)
    
for s_i in range(len(sentences_wiki)):
    sentences_wiki[s_i] = re.sub("[^a-z]", " ", sentences_wiki[s_i].lower())
    sentences_wiki[s_i] = re.sub(r'\([^)]*\)', '', sentences_wiki[s_i])
del input_text

In [93]:
# sample 1/5 of the data
shuffle(sentences_wiki)
print(len(sentences_wiki))
sentences_wiki = sentences_wiki[:int(len(sentences_wiki)/5)]
print(len(sentences_wiki))

4267112
853422


In [102]:
print(sentences_wiki[0])
print(sentences_wiki[1])

 elements of this mobile strike force company operated from a fortified bunker about     meters west of the camp   which served as an observation post 
 the who have received many awards and accolades from the music industry for their recordings and their influence 


Now, repeat all the same steps that you performed above. You should be able to reuse essentially all the code.

### list of the most common words and their occurence counts

In [97]:
# ...
from collections import Counter
from itertools import chain

counts = Counter(chain.from_iterable(sentences_wiki))
print(counts.most_common(1000))

[(' ', 25010069), ('e', 9884124), ('t', 7134481), ('a', 7018304), ('i', 5972806), ('n', 5948093), ('o', 5830652), ('r', 5385881), ('s', 5335018), ('h', 3964298), ('l', 3351718), ('d', 3339278), ('c', 2678770), ('u', 2108656), ('m', 2098620), ('f', 1814527), ('g', 1656539), ('p', 1654461), ('w', 1428007), ('b', 1291292), ('y', 1230060), ('v', 848731), ('k', 567658), ('j', 175057), ('x', 157038), ('z', 95494), ('q', 77374)]


In [98]:
from collections import Counter
from itertools import chain

counts = Counter(chain.from_iterable(sentences_wiki))
#print(counts.most_common(1000))
for letter, count in counts.most_common(1000):
    print '%s: %7d' % (letter, count)

 : 25010069
e: 9884124
t: 7134481
a: 7018304
i: 5972806
n: 5948093
o: 5830652
r: 5385881
s: 5335018
h: 3964298
l: 3351718
d: 3339278
c: 2678770
u: 2108656
m: 2098620
f: 1814527
g: 1656539
p: 1654461
w: 1428007
b: 1291292
y: 1230060
v:  848731
k:  567658
j:  175057
x:  157038
z:   95494
q:   77374


### Storing the list of word counts in counts_ted_top1000

#### t-SNE visualization

In [None]:
# This assumes words_top_wiki is a list of strings, the top 1000 words
words_top_vec_wiki = model_wiki[words_top_wiki]

tsne = TSNE(n_components=2, random_state=0)
words_top_wiki_tsne = tsne.fit_transform(words_top_vec_wiki)

In [None]:
p = figure(tools="pan,wheel_zoom,reset,save",
           toolbar_location="above",
           title="word2vec T-SNE for most common words")

source = ColumnDataSource(data=dict(x1=words_top_wiki_tsne[:,0],
                                    x2=words_top_wiki_tsne[:,1],
                                    names=words_top_wiki))

p.scatter(x="x1", y="x2", size=8, source=source)

labels = LabelSet(x="x1", y="x2", text="names", y_offset=6,
                  text_font_size="8pt", text_color="#555555",
                  source=source, text_align='center')
p.add_layout(labels)

show(p)