# Word Representations

## *"I know words. I have the best words!"*
    - Noam Chomsky

## Discrete Sparse Representations

In [1]:
! pip install wget

Collecting wget
  Downloading https://files.pythonhosted.org/packages/47/6a/62e288da7bcda82b935ff0c6cfe542970f04e29c756b0e147251b2fb251f/wget-3.2.zip
Building wheels for collected packages: wget
  Building wheel for wget (setup.py) ... [?25l[?25hdone
  Created wheel for wget: filename=wget-3.2-cp37-none-any.whl size=9681 sha256=3c28a39813480d99e24cd7de97987951bcde0ff92aacef747128c26b10f25aeb
  Stored in directory: /root/.cache/pip/wheels/40/15/30/7d8f7cea2902b4db79e3fea550d7d7b85ecb27ef992b618f3f
Successfully built wget
Installing collected packages: wget
Successfully installed wget-3.2


In [2]:
import wget
url = 'https://raw.githubusercontent.com/dirkhovy/NLPclass/master/data/reviews.full.tsv.zip'
wget.download(url, 'reviews.full.tsv.zip')

'reviews.full.tsv.zip'

In [3]:
from zipfile import ZipFile
with ZipFile('reviews.full.tsv.zip', 'r') as zf:
    zf.extractall()

In [4]:
import pandas as pd
df = pd.read_csv('reviews.full.tsv', sep='\t', nrows=100000)
documents = df.text.tolist()
print(documents[:4])

["Prices change daily and if you want to really research the price continually at many different sites , I have found cheaper cars elsewhere . However , if you don ' t have a lot of time to research the price , this site has always been among the top three ( e . g ., cheapest ) of the ten sites I use to reserve a car .", 'and the fact that they will match other companies is awesome !!', "Used Paypal for my buying and selling for the past 0 years and never had an issue they didn ' t resolve to my satisfaction .", "I ' ve made two purchases on CJ ' s for Fallout : New Vegas and The Elder Scrolls V : Skyrim . I have been satisfied by both , being extremely cheaper than the Steam versions . The Autokey system that CJ ' s uses is genius . I recommend this site to anyone who is a PC gamer !"]


In [16]:
df.head(10)

Unnamed: 0,score,category,uid,gender,age,text
0,5,Car Rental,899881,F,50,Prices change daily and if you want to really ...
1,5,Fitness & Nutrition,828184,M,32,and the fact that they will match other compan...
2,5,Electronic Payment,1698375,M,48,Used Paypal for my buying and selling for the ...
3,5,Gaming,3324079,M,29,I ' ve made two purchases on CJ ' s for Fallou...
4,4,Jewelry,719816,F,29,I was very happy with the diamond that I order...
5,5,Security Equipment,5630105,F,66,I signed up with front point security 0 months...
6,5,Electronics,6929926,M,69,First off I usually never get extended warrant...
7,5,Gaming,2364273,M,20,"The games come , no worries , they are reputab..."
8,1,Media & Marketing,2561769,F,32,We worked hard to send out email invitations f...
9,4,Shoes,2561769,F,32,I am in love with all the free movies and show...


In [17]:
df.score # we have an array, type is pandas.series

0        5
1        5
2        5
3        5
4        4
        ..
99995    1
99996    5
99997    3
99998    1
99999    5
Name: score, Length: 100000, dtype: int64

In [18]:
df.score.value_counts() # amount of time the value appear

5    78827
4     9164
1     7316
3     2496
2     2197
Name: score, dtype: int64

In [19]:
df.gender.value_counts() # or df['gender'].value_counts()

M    59708
F    40292
Name: gender, dtype: int64

In [15]:
df.describe()

Unnamed: 0,score,uid,age
count,100000.0,100000.0,100000.0
mean,4.49989,2697134.0,41.31774
std,1.144409,2068414.0,13.841225
min,1.0,10386.0,16.0
25%,5.0,1037387.0,30.0
50%,5.0,2088027.0,41.0
75%,5.0,3877405.0,52.0
max,5.0,8363749.0,70.0


In [5]:
from sklearn.feature_extraction.text import CountVectorizer
small_vectorizer = CountVectorizer()

sentences_2 = documents[:1]

X1 = small_vectorizer.fit_transform(sentences_2)

In [6]:
small_vectorizer
# binary = we can count
# lowercase = before creting vocabolary make all in lower case
# max_features = set number of columns
# max/min df = frequency of words, we can set them to avoid stopwords to 
#   be included in the vocabolary
# ngram_range = by default is uni-grams
# preprocessor = automatically preprocess, it takes the function to be apply
# tokenizer = add different tokenizers, ex. there're specific tokenizers for twitter
# vocabulary = ??

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
                dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
                lowercase=True, max_df=1.0, max_features=None, min_df=1,
                ngram_range=(1, 1), preprocessor=None, stop_words=None,
                strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
                tokenizer=None, vocabulary=None)

Let's implement this ourselves:

In [None]:
import numpy as np
num_docs = 10

# collect all word types (= vocabulary)
vocabulary = set()
for document in documents[:num_docs]:
    tokens = document.lower().split() # collect tokens (using the most easier tokenizer)
    vocabulary = vocabulary.union(set(tokens))
vocabulary = sorted(vocabulary) # sort in alphabetic way
# we don't want vocabulary item composed by multiple items, like '.,' for example

# create the DATA MATRIX with #docs-by-#features dimensions
X = np.zeros((num_docs, len(vocabulary))) #filled with zeros at the beginning

# fill that matrix with sweet counts
# enumerate: provide document and also the index position
# we are augmenting the data matrix row  by column
for d, document in enumerate(documents[:num_docs]):
    tokens = document.lower().split()
    for i, feature in enumerate(vocabulary):
        X[d, i] = tokens.count(feature)

# show the result as a DataFrame
pd.DataFrame(data=X, columns=vocabulary, dtype=int)

# the vocabolary, with 10 docs, is increased
# we have a sparse matrix (a lot of zeros)

In [None]:
vocabulary_ = {word: position for position, word in enumerate(vocabulary)}
vocabulary_
# vocabulary word and position in the vocabulary

The result is a *sparse count matrix*:

In [9]:
# indexed representation
import numpy as np
# print(X1)

# dense representation
print(X1.todense())

[[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 2 1 1 2 2 1 1 2 1 1 2 1 4 1 1 1 3
  1 1 1 2]]


We can access the mapping from vector position to feature names via `get_feature_names()`:

In [10]:
print(small_vectorizer.get_feature_names())

['always', 'among', 'and', 'at', 'been', 'car', 'cars', 'change', 'cheaper', 'cheapest', 'continually', 'daily', 'different', 'don', 'elsewhere', 'found', 'has', 'have', 'however', 'if', 'lot', 'many', 'of', 'price', 'prices', 'really', 'research', 'reserve', 'site', 'sites', 'ten', 'the', 'this', 'three', 'time', 'to', 'top', 'use', 'want', 'you']


The inverse (the mapping from feature names to vector positions) is encoded as a list in `vocabulary_`:

In [11]:
print(small_vectorizer.vocabulary_)
# main difference is that vocabulary provide name and position of the column
# while get_feature_name provide column name in sorted order

{'prices': 24, 'change': 7, 'daily': 11, 'and': 2, 'if': 19, 'you': 39, 'want': 38, 'to': 35, 'really': 25, 'research': 26, 'the': 31, 'price': 23, 'continually': 10, 'at': 3, 'many': 21, 'different': 12, 'sites': 29, 'have': 17, 'found': 15, 'cheaper': 8, 'cars': 6, 'elsewhere': 14, 'however': 18, 'don': 13, 'lot': 20, 'of': 22, 'time': 34, 'this': 32, 'site': 28, 'has': 16, 'always': 0, 'been': 4, 'among': 1, 'top': 36, 'three': 33, 'cheapest': 9, 'ten': 30, 'use': 37, 'reserve': 27, 'car': 5}


## Terminology 

![](matrix.pdf)

Let's redo this for the entire corpus:

In [42]:
vectorizer = CountVectorizer(analyzer='word', 
                             ngram_range=(1, 2), 
                             min_df=0.001, # minimum ratio frequency
                             max_df=0.75, 
                             stop_words='english')

# check list of stop_words because if are doing sentiment analysis it may 
# delete some words that are key-words -> use a customize list of stop-words
# by using nltk package functions

X = vectorizer.fit_transform(documents[:10000])

# fit_transform :. obtaining vocabolary (fit) and count (transform)
# we should not use anymore fit_transfor and fit, but just transform

print("shape: ", X.shape)

shape:  (10000, 3869)



```
x1 = vectorizer.fit_transform(documents[:10000])
x2 = vectorizer.fit_transform(documents[10000:2000])

x1 = vectorizer.fit_transform(documents[:10000])
x2 = vectorizer.transform(documents[10000:2000])
```

What are the difference?

In the first case x1 and x2 are two not comparable matrix (this is wrong),
while in the second case the two matrix are comparable (this is okay).

Basically in the first case we are fitting in a data matrix but testing in another one (?)

Calling `transform()` on a new document will apply the vocabulary we collected previously to this new data point. Any words we have not seen before are ignored.


In [44]:
vectorizer.transform([documents[-1]])

<1x3869 sparse matrix of type '<class 'numpy.int64'>'
	with 8 stored elements in Compressed Sparse Row format>

In [45]:
documents[-1]

'Never had any issues , easy to use and great prices .'

## Exercise

Use vector operations to find out 
- what the 5 most frequent words are in `X`
- in how many different documents the word `delivery` occurs
- what percentage of the overall corpus that number corresponds to

In [None]:
# your code here
print(vectorizer.get_feature_names()) # feature names (mean the columns)
print(vectorizer.get_feature_names()[3008]) # feature name with index 3008
print(X) # (row column) frequency of feature (column name)

import numpy as np
from collections import Counter

# counts = np.asarray(X.sum(axis = 0))[0]
# count_ids = counts.argsort()[::-1] # from the most frequent word (by index) to the less
# feature_names = vectorizer.get_feature_names()
# for idx in count_ids[:5]:
#   print(feature_names[idx])
# np.count_nonzero(X[:,vectorizer.vocabulary_['delivery']].toarray()) / X.shape[0]
# we have a certain percent of docs that contain the word 'delivery'

sum = X.sum(axis = 0)
Counter(vectorizer.get_feature_names(), sum)

## Character $n$-grams

We can also use characters to analyze text:

In [55]:
char_vectorizer = CountVectorizer(analyzer='char', 
                                  ngram_range=(2, 6), 
                                  min_df=0.001, 
                                  max_df=0.75)

C = char_vectorizer.fit_transform(documents[:10])
C

<10x8054 sparse matrix of type '<class 'numpy.int64'>'
	with 10806 stored elements in Compressed Sparse Row format>

In [56]:
print(char_vectorizer.vocabulary_)

{'pr': 5953, 'ic': 4050, 'ce': 2121, 'ch': 2155, 'ng': 5194, 'ge': 3612, ' d': 382, 'da': 2407, 'ai': 1609, 'il': 4153, 'ly': 4732, 'if': 4118, 'f ': 3378, ' y': 1264, 'yo': 8014, 'ou': 5723, 'u ': 7367, 'wa': 7716, 'nt': 5298, 'ea': 2786, 'ar': 1819, 'rc': 6141, 'ti': 7183, 'nu': 5338, 'ua': 7387, 'at': 1904, 'ny': 5348, 'di': 2471, 'ff': 3452, 'fe': 3434, 'en': 3045, 'si': 6696, 'it': 4361, 'te': 7022, ' ,': 51, ', ': 1337, 'i ': 3975, 'av': 1959, 'fo': 3495, 'un': 7438, 'ap': 1805, 'pe': 5880, 'ca': 2106, 'rs': 6380, ' e': 430, 'el': 2985, 'ls': 4709, 'ew': 3324, 'wh': 7773, '. ': 1395, 'ho': 3918, 'ow': 5794, 'we': 7736, 'ev': 3306, 'do': 2505, " '": 17, "' ": 1295, 'a ': 1501, ' l': 690, 'lo': 4679, 'ot': 5699, ' o': 785, 'of': 5474, 'im': 4177, 'hi': 3896, 'as': 1867, 'lw': 4727, 'ay': 1987, 'ys': 8040, ' b': 296, 'be': 2021, 'ee': 2919, 'am': 1680, 'mo': 4891, 'g ': 3554, 'op': 5653, 'p ': 5817, 'hr': 3943, ' (': 34, '( ': 1318, ' g': 524, '.,': 1449, ' )': 42, ') ': 1327, ' u':

## Syntactic $n$-grams

In [None]:
import spacy
nlp = spacy.load('en')

features = [' '.join(["{}_{}".format(c.lemma_, c.head.lemma_) 
                      for c in nlp(sentence)])
            for sentence in documents[:100]]

syntax_vectorizer = CountVectorizer()
X = syntax_vectorizer.fit_transform(features)

In [None]:
print(documents[0])
print(features[0])

In [None]:
print(syntax_vectorizer.vocabulary_)

# Dense Distributed Representations

## Word embeddings with `Word2vec`

In [57]:
from gensim.models import Word2Vec
from gensim.models.word2vec import FAST_VERSION

corpus = [document.split() for document in documents]

# initialize model
w2v_model = Word2Vec(size=100, 
                     window=15,
                     sample=0.0001,
                     iter=200,
                     negative=5, 
                     min_count=100,
                     workers=-1, 
                     hs=0
)

# size = size of the dimension of word vector, usually 300 and take more time

w2v_model.build_vocab(corpus)

w2v_model.train(corpus, 
                total_examples=w2v_model.corpus_count, 
                epochs=w2v_model.epochs)


(0, 0)

In [58]:
print(corpus[0])

['Prices', 'change', 'daily', 'and', 'if', 'you', 'want', 'to', 'really', 'research', 'the', 'price', 'continually', 'at', 'many', 'different', 'sites', ',', 'I', 'have', 'found', 'cheaper', 'cars', 'elsewhere', '.', 'However', ',', 'if', 'you', 'don', "'", 't', 'have', 'a', 'lot', 'of', 'time', 'to', 'research', 'the', 'price', ',', 'this', 'site', 'has', 'always', 'been', 'among', 'the', 'top', 'three', '(', 'e', '.', 'g', '.,', 'cheapest', ')', 'of', 'the', 'ten', 'sites', 'I', 'use', 'to', 'reserve', 'a', 'car', '.']


Now, we can use the embeddings of the model

In [59]:
w2v_model.wv['delivery']

array([-2.7229232e-03,  4.7890134e-03, -8.8076765e-04, -2.5874416e-03,
        4.6208804e-03, -6.7487318e-04, -2.0464528e-03, -4.0752915e-04,
       -3.8732942e-03,  3.5820017e-03,  3.3607316e-04,  1.3584318e-03,
        1.1764330e-05,  2.0887891e-03,  1.1148849e-03,  2.7005817e-04,
        3.1010199e-03, -2.3122663e-03, -1.9958511e-03, -4.2375810e-03,
        4.7674864e-03, -4.4565406e-03,  2.0053375e-03, -3.5388852e-03,
        4.0580211e-03,  2.7737829e-03,  3.6442596e-03,  3.6585934e-03,
        3.7986238e-03,  1.1509076e-03,  4.0841298e-03, -2.7280578e-03,
       -4.8374305e-03, -2.3925002e-04, -4.1261194e-03,  3.5741476e-03,
       -3.6344498e-03, -2.6655041e-03,  2.2219250e-03,  4.2993170e-03,
       -4.6622329e-03, -6.8110833e-04, -3.9750533e-03, -3.9369785e-03,
       -1.5668268e-03,  4.5235441e-03, -4.9112514e-03,  1.4892700e-03,
       -4.8684804e-03, -9.4596267e-04, -4.4946340e-03,  1.1054854e-03,
        3.1266590e-03, -2.5223212e-03,  5.3584145e-04,  2.3242813e-03,
      

In [61]:
w2v_model.wv.most_similar(['delivery'])
# most similar words to 'deliver', with their similarity value
# higher is better (max is 1)
# consider that in general they have not a so higher values

[('pounds', 0.3349335491657257),
 ('red', 0.32473188638687134),
 ('door', 0.3136950731277466),
 ('mothers', 0.29352810978889465),
 ('acceptable', 0.29300832748413086),
 ('scheduled', 0.2918166220188141),
 ('reccomend', 0.281585693359375),
 ('living', 0.2782709300518036),
 ('stars', 0.2742609977722168),
 ('scan', 0.262966513633728)]

In [62]:
w2v_model.wv.most_similar(['delivery','concert'])
# words more similar to 'delivery' and 'concert'
# we should mean them like points in the space and we can concider 
# points "in the middle" between the points that we used
# consider Neighbour procedure on the vector space

[('displayed', 0.3079010844230652),
 ('........', 0.30706268548965454),
 ('anymore', 0.30380067229270935),
 ('directly', 0.2870674133300781),
 ('Still', 0.28459298610687256),
 ('gesture', 0.28425103425979614),
 ('market', 0.27986207604408264),
 ('Animed', 0.27397236227989197),
 ('door', 0.2671118378639221),
 ('occasions', 0.2609759569168091)]

In [63]:
# birthday - present + husband => birthday:present as husband:?
w2v_model.wv.most_similar(positive=['birthday', 'husband'], negative=['present'], topn=3)

[('dinner', 0.33260515332221985),
 ('SO', 0.32500073313713074),
 ('disappointing', 0.3154750168323517)]

In [64]:
word1 = "Cheapest"
word2 = "friendly"

# retrieve the actual vector (we will never use it)
# print(w2v_model.wv[word1])

# compare by computing similarity between two words
print(w2v_model.wv.similarity(word1, word2))

# get the 3 most similar words
print(w2v_model.wv.most_similar(word1, topn=3))


-0.010900016
[('months', 0.37436985969543457), ('town', 0.33676764369010925), ('United', 0.3110150992870331)]



### Exercise
Use `spacy` to restrict the words in the tweets to *content words*, i.e., nouns, verbs, and adjectives. Transform the words to lower case and add the POS with an underderscore. E.g.:

`love_VERB old-fashioneds_NOUN`

This also allows us to distinguish between homographs, i.e., words that are written the same, but belong to different word classes, e.g., *love* in "I **love** old-fashioneds" vs. "He felt so sick, it must have been **love**".


Make sure to exclude sentences that contain none of the above.

Write the resulting corpus to a variable called `word_corpus`.

In [None]:
# Your code here

Rerun the `Word2vec` model from above on the new data set and test the words out

In [None]:
# Your code here

## Exercise

Train 4 more `Word2vec` models and average the resulting embedding matrices.

In [None]:
# Your code here



## Document embeddings with `Doc2Vec`

In [65]:
df.head()

Unnamed: 0,score,category,uid,gender,age,text
0,5,Car Rental,899881,F,50,Prices change daily and if you want to really ...
1,5,Fitness & Nutrition,828184,M,32,and the fact that they will match other compan...
2,5,Electronic Payment,1698375,M,48,Used Paypal for my buying and selling for the ...
3,5,Gaming,3324079,M,29,I ' ve made two purchases on CJ ' s for Fallou...
4,4,Jewelry,719816,F,29,I was very happy with the diamond that I order...


In [66]:
from gensim.models import Doc2Vec
from gensim.models.doc2vec import FAST_VERSION
from gensim.models.doc2vec import TaggedDocument

corpus = []

for row in df.iterrows():
    label = row[1].score # as a tag we take the score
    text = row[1].text
    corpus.append(TaggedDocument(words=text.split(), tags=[str(label)]))

print('done')
d2v_model = Doc2Vec(vector_size=100, 
                    window=15,
                    hs=0,
                    sample=0.000001,
                    negative=5,
                    min_count=100,
                    workers=-1,
                    epochs=500,
                    dm=0, 
                    dbow_words=1)

d2v_model.build_vocab(corpus)

d2v_model.train(corpus, total_examples=d2v_model.corpus_count, epochs=d2v_model.epochs)

done


We can now look at the elements

In [None]:
d2v_model.docvecs[0]

In [None]:
d2v_model.docvecs.doctags

In [None]:
target_doc = '1'

similar_docs = d2v_model.docvecs.most_similar(target_doc, topn=5)
print(similar_docs)
# we compute the similarity with the other tags

## Exercise

What are the 10 most similar ***words*** to each category?

In [None]:
# your code here