# Word Representations

## *"I know words. I have the best words!"*
    - Noam Chomsky

## Discrete Sparse Representations

In [1]:
! pip install wget

Collecting wget
  Downloading https://files.pythonhosted.org/packages/47/6a/62e288da7bcda82b935ff0c6cfe542970f04e29c756b0e147251b2fb251f/wget-3.2.zip
Building wheels for collected packages: wget
  Building wheel for wget (setup.py) ... [?25l[?25hdone
  Created wheel for wget: filename=wget-3.2-cp37-none-any.whl size=9681 sha256=3a100c29566ede524b69a1a710d67ddd4cf85837583b8d6b004fa735791537b0
  Stored in directory: /root/.cache/pip/wheels/40/15/30/7d8f7cea2902b4db79e3fea550d7d7b85ecb27ef992b618f3f
Successfully built wget
Installing collected packages: wget
Successfully installed wget-3.2


In [2]:
import wget
url = 'https://raw.githubusercontent.com/dirkhovy/NLPclass/master/data/reviews.full.tsv.zip'
wget.download(url, 'reviews.full.tsv.zip')

'reviews.full.tsv.zip'

In [3]:
from zipfile import ZipFile
with ZipFile('reviews.full.tsv.zip', 'r') as zf:
    zf.extractall()

In [4]:
import pandas as pd
df = pd.read_csv('reviews.full.tsv', sep='\t', nrows=100000)
documents = df.text.tolist()
print(documents[:4])

["Prices change daily and if you want to really research the price continually at many different sites , I have found cheaper cars elsewhere . However , if you don ' t have a lot of time to research the price , this site has always been among the top three ( e . g ., cheapest ) of the ten sites I use to reserve a car .", 'and the fact that they will match other companies is awesome !!', "Used Paypal for my buying and selling for the past 0 years and never had an issue they didn ' t resolve to my satisfaction .", "I ' ve made two purchases on CJ ' s for Fallout : New Vegas and The Elder Scrolls V : Skyrim . I have been satisfied by both , being extremely cheaper than the Steam versions . The Autokey system that CJ ' s uses is genius . I recommend this site to anyone who is a PC gamer !"]


In [5]:
df.head(10)

Unnamed: 0,score,category,uid,gender,age,text
0,5,Car Rental,899881,F,50,Prices change daily and if you want to really ...
1,5,Fitness & Nutrition,828184,M,32,and the fact that they will match other compan...
2,5,Electronic Payment,1698375,M,48,Used Paypal for my buying and selling for the ...
3,5,Gaming,3324079,M,29,I ' ve made two purchases on CJ ' s for Fallou...
4,4,Jewelry,719816,F,29,I was very happy with the diamond that I order...
5,5,Security Equipment,5630105,F,66,I signed up with front point security 0 months...
6,5,Electronics,6929926,M,69,First off I usually never get extended warrant...
7,5,Gaming,2364273,M,20,"The games come , no worries , they are reputab..."
8,1,Media & Marketing,2561769,F,32,We worked hard to send out email invitations f...
9,4,Shoes,2561769,F,32,I am in love with all the free movies and show...


In [6]:
df.score # we have an array, type is pandas.series

0        5
1        5
2        5
3        5
4        4
        ..
99995    1
99996    5
99997    3
99998    1
99999    5
Name: score, Length: 100000, dtype: int64

In [7]:
df.score.value_counts() # amount of time the value appear

5    78827
4     9164
1     7316
3     2496
2     2197
Name: score, dtype: int64

In [8]:
df.gender.value_counts() # or df['gender'].value_counts()

M    59708
F    40292
Name: gender, dtype: int64

In [9]:
df.describe()

Unnamed: 0,score,uid,age
count,100000.0,100000.0,100000.0
mean,4.49989,2697134.0,41.31774
std,1.144409,2068414.0,13.841225
min,1.0,10386.0,16.0
25%,5.0,1037387.0,30.0
50%,5.0,2088027.0,41.0
75%,5.0,3877405.0,52.0
max,5.0,8363749.0,70.0


In [10]:
from sklearn.feature_extraction.text import CountVectorizer # bag of words representation
small_vectorizer = CountVectorizer() # rename function -> small_vectorizer it's a function

sentences_2 = documents[:1]

X1 = small_vectorizer.fit_transform(sentences_2)

In [11]:
small_vectorizer

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
                dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
                lowercase=True, max_df=1.0, max_features=None, min_df=1,
                ngram_range=(1, 1), preprocessor=None, stop_words=None,
                strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
                tokenizer=None, vocabulary=None)

CountVectorizer paramethers: 
* binary = we can count
* lowercase = before creating vocabolary make all in lower case
* max_features = set number of columns
* max/min df = max and min frequency of words, we are going to consider words
 into the frequency range [max, min] others are ignored;
* we can set them to avoid stopwords to be included in the vocabolary
* ngram_range = by default is uni-grams
* preprocessor = automatically preprocess, it takes the function to be apply
* tokenizer = add different tokenizers, ex. there're specific tokenizers for twitter
* vocabulary = when we are fitting, we are training the vectorizer and create a
 vocabulary based on the given document; may be the case that we aleady have a 
 vocabulary and we want to translate that vocabulary into this one

Let's implement this ourselves:

In [12]:
import numpy as np
num_docs = 10

# collect all word types (= vocabulary)
vocabulary = set()
for document in documents[:num_docs]:
    tokens = document.lower().split() # collect tokens (using the most easier tokenizer)
    vocabulary = vocabulary.union(set(tokens))
vocabulary = sorted(vocabulary) # sort in alphabetic way
# we don't want vocabulary item composed by multiple items, like '.,' for example

# create the DATA MATRIX with #docs-by-#features dimensions
X = np.zeros((num_docs, len(vocabulary))) #filled with zeros at the beginning

# fill that matrix with sweet counts
# enumerate: provide document and also the index position
# we are augmenting the data matrix row by column
for d, document in enumerate(documents[:num_docs]):
    tokens = document.lower().split()
    for i, feature in enumerate(vocabulary):
        X[d, i] = tokens.count(feature)

# show the result as a DataFrame
pd.DataFrame(data=X, columns=vocabulary, dtype=int)

# the vocabolary, with 10 docs, is increased
# we have a sparse matrix (a lot of zeros)

# row label is refered to documents...

Unnamed: 0,!,!!,$,',(,),"),",",",.,".,",..,...,0,00,:,a,able,about,act,addition,after,again,ago,all,allowing,also,always,am,amazon,among,an,and,another,anyone,are,area,as,at,autokey,away,....1,trade,transfer,trustpilot,tues,tv,two,up,upsetting,us,use,used,uses,usually,v,ve,vegas,verify,versions,very,want,warranties,warranty,was,we,website,week,were,when,which,who,why,will,wire,wired,with,worked,worries,would,years,you
0,0,0,0,1,1,1,0,3,3,1,0,0,0,0,0,2,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,1,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2
1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0
2,0,0,0,1,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,2,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0
3,1,0,0,3,0,0,0,1,3,0,0,0,0,0,2,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,1,0,...,0,0,0,0,0,1,0,0,0,0,0,1,0,1,1,1,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0
4,3,0,1,1,1,0,1,6,6,0,0,1,1,1,0,1,1,0,0,0,0,1,0,0,1,0,0,0,0,0,1,5,0,0,0,0,2,0,0,1,...,0,4,0,1,0,0,2,0,0,0,0,0,0,0,0,0,1,0,2,0,0,0,2,0,1,0,2,0,0,0,1,2,4,1,5,1,0,1,0,1
5,1,0,0,0,0,0,0,3,2,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,1,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0
6,0,0,0,1,0,0,0,3,1,0,0,0,0,0,0,4,0,1,1,0,1,0,0,0,0,0,0,0,0,0,1,5,0,0,0,1,0,0,0,0,...,2,0,0,0,3,0,2,0,0,0,0,0,1,0,0,0,0,0,0,0,1,2,0,0,0,0,0,1,1,1,0,0,0,0,0,0,0,1,0,1
7,0,0,0,0,0,0,0,2,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0
8,0,0,0,0,0,0,0,0,3,0,1,0,1,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,2,1,0,0,0,0,1,0,0,...,0,0,2,0,0,0,1,0,2,0,1,0,0,0,0,0,0,0,0,0,0,0,0,4,0,1,1,0,0,0,0,0,0,0,1,1,0,0,0,0
9,0,0,0,0,0,0,0,0,2,0,0,0,0,0,0,1,0,0,0,1,0,0,0,2,0,0,0,1,1,0,1,1,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,0,0,0,0,1


In [13]:
vocabulary_ = {word : position for position, word in enumerate(vocabulary)}
vocabulary_
# vocabulary word and position in the vocabulary

{'!': 0,
 '!!': 1,
 '$': 2,
 "'": 3,
 '(': 4,
 ')': 5,
 '),': 6,
 ',': 7,
 '.': 8,
 '.,': 9,
 '..': 10,
 '...': 11,
 '0': 12,
 '00': 13,
 ':': 14,
 'a': 15,
 'able': 16,
 'about': 17,
 'act': 18,
 'addition': 19,
 'after': 20,
 'again': 21,
 'ago': 22,
 'all': 23,
 'allowing': 24,
 'also': 25,
 'always': 26,
 'am': 27,
 'amazon': 28,
 'among': 29,
 'an': 30,
 'and': 31,
 'another': 32,
 'anyone': 33,
 'are': 34,
 'area': 35,
 'as': 36,
 'at': 37,
 'autokey': 38,
 'away': 39,
 'awesome': 40,
 'bank': 41,
 'be': 42,
 'because': 43,
 'been': 44,
 'being': 45,
 'believe': 46,
 'both': 47,
 'brilliance': 48,
 'but': 49,
 'buy': 50,
 'buying': 51,
 'by': 52,
 'called': 53,
 'car': 54,
 'cars': 55,
 'change': 56,
 'cheaper': 57,
 'cheapest': 58,
 'cj': 59,
 'come': 60,
 'companies': 61,
 'company': 62,
 'confirm': 63,
 'confirmed': 64,
 'continually': 65,
 'could': 66,
 'couple': 67,
 'customer': 68,
 'customers': 69,
 'daily': 70,
 'decision': 71,
 'declined': 72,
 'def': 73,
 'delayed': 74,

The result is a *sparse count matrix*:

In [14]:
# indexed representation
import numpy as np
# print(X1)

# dense representation
print(X1.todense())

[[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 2 1 1 2 2 1 1 2 1 1 2 1 4 1 1 1 3
  1 1 1 2]]


We can access the mapping from vector position to feature names via `get_feature_names()`:

In [15]:
print(small_vectorizer.get_feature_names())
# get_feature_name = gives column name in sorted order

['always', 'among', 'and', 'at', 'been', 'car', 'cars', 'change', 'cheaper', 'cheapest', 'continually', 'daily', 'different', 'don', 'elsewhere', 'found', 'has', 'have', 'however', 'if', 'lot', 'many', 'of', 'price', 'prices', 'really', 'research', 'reserve', 'site', 'sites', 'ten', 'the', 'this', 'three', 'time', 'to', 'top', 'use', 'want', 'you']


The inverse (the mapping from feature names to vector positions) is encoded as a list in `vocabulary_`:

In [16]:
print(small_vectorizer.vocabulary_)
# vocabulary_ = gives name and idx of the column

{'prices': 24, 'change': 7, 'daily': 11, 'and': 2, 'if': 19, 'you': 39, 'want': 38, 'to': 35, 'really': 25, 'research': 26, 'the': 31, 'price': 23, 'continually': 10, 'at': 3, 'many': 21, 'different': 12, 'sites': 29, 'have': 17, 'found': 15, 'cheaper': 8, 'cars': 6, 'elsewhere': 14, 'however': 18, 'don': 13, 'lot': 20, 'of': 22, 'time': 34, 'this': 32, 'site': 28, 'has': 16, 'always': 0, 'been': 4, 'among': 1, 'top': 36, 'three': 33, 'cheapest': 9, 'ten': 30, 'use': 37, 'reserve': 27, 'car': 5}


## Terminology 

![](matrix.pdf)

Let's redo this for the entire corpus:

In [17]:
vectorizer = CountVectorizer(analyzer='word', 
                             ngram_range=(1, 2), 
                             min_df=0.001, # minimum ratio frequency
                             max_df=0.75, 
                             stop_words='english')

# check list of stop_words because if are doing sentiment analysis it may 
# delete some words that are key-words -> use a customize list of stop-words
# by using nltk package functions

X = vectorizer.fit_transform(documents[:10000])

# fit_transform : obtaining vocabolary (fit) and counts (transform)
# we should not use anymore fit_transform and fit, but just transform

print("shape: ", X.shape)

shape:  (10000, 3869)



```
x1 = vectorizer.fit_transform(documents[:10000])
x2 = vectorizer.fit_transform(documents[10000:2000])

x1 = vectorizer.fit_transform(documents[:10000])
x2 = vectorizer.transform(documents[10000:2000])
```

What are the difference?

In the first case x1 and x2 are two not comparable matrix (this is wrong),
while in the second case the two matrix are comparable (this is okay).

Basically in the first case we are fitting in a data matrix but testing in another one (?)

Calling `transform()` on a new document will apply the vocabulary we collected previously to this new data point. Any words we have not seen before are ignored.


In [18]:
vectorizer.transform([documents[-1]])

<1x3869 sparse matrix of type '<class 'numpy.int64'>'
	with 8 stored elements in Compressed Sparse Row format>

In [19]:
documents[-1]

'Never had any issues , easy to use and great prices .'

## Exercise

Use vector operations to find out 
- what the 5 most frequent words are in `X`
- in how many different documents the word `delivery` occurs
- what percentage of the overall corpus that number corresponds to

In [20]:
# --- my solution ---

import pandas as pd
import numpy as np
from collections import Counter

X_df = pd.DataFrame(X.toarray(), columns = vectorizer.get_feature_names())
X_count_dict = {X_df.columns[idx] : X_df.iloc[:, idx].sum() for idx in range(X.shape[1])}
print("5 most frequent wrods: ", Counter(X_count_dict).most_common(5))
print("number of docs with the word 'delivery' = ", X_count_dict['delivery'])
print("percentage of 'delivery' on the corpus = ", X_count_dict['delivery'] / X.shape[0])

5 most frequent wrods:  [('00', 2325), ('great', 2268), ('service', 2123), ('time', 2069), ('order', 2056)]
number of docs with the word 'delivery' =  902
percentage of 'delivery' on the corpus =  0.0902


In [21]:
# --- professor solution ---

import numpy as np
# your code here
print(vectorizer.get_feature_names()) # feature names (mean the columns)
print(vectorizer.get_feature_names()[3008]) # feature name with index 3008
print(X) # (row column) frequency of feature (column name)

counts = np.asarray(X.sum(axis = 0))[0]
count_ids = counts.argsort()[::-1] # from the most frequent word (by index) to the less
feature_names = vectorizer.get_feature_names()
for idx in count_ids[:5]:
  print(feature_names[idx])
np.count_nonzero(X[:,vectorizer.vocabulary_['delivery']].toarray()) / X.shape[0]
# we have a certain percent of docs that contain the word 'delivery'

sum = X.sum(axis = 0)
#Counter(vectorizer.get_feature_names(), sum)

sent wrong
  (0, 2510)	1
  (0, 520)	1
  (0, 795)	1
  (0, 3712)	1
  (0, 2687)	1
  (0, 2838)	2
  (0, 2473)	2
  (0, 917)	1
  (0, 3156)	2
  (0, 536)	1
  (0, 487)	1
  (0, 963)	1
  (0, 1933)	1
  (0, 3439)	1
  (0, 3142)	1
  (0, 541)	1
  (0, 3606)	1
  (0, 472)	1
  (0, 919)	1
  (0, 1935)	1
  (0, 2503)	1
  (1, 1166)	1
  (1, 2008)	1
  (1, 631)	1
  (1, 259)	1
  :	:
  (9999, 3857)	1
  (9999, 78)	1
  (9999, 839)	1
  (9999, 3404)	1
  (9999, 3625)	1
  (9999, 850)	1
  (9999, 87)	2
  (9999, 1320)	1
  (9999, 564)	1
  (9999, 1271)	2
  (9999, 1161)	1
  (9999, 2359)	1
  (9999, 1950)	1
  (9999, 1848)	1
  (9999, 906)	1
  (9999, 1610)	1
  (9999, 1162)	1
  (9999, 1515)	1
  (9999, 1809)	2
  (9999, 2899)	1
  (9999, 2070)	1
  (9999, 625)	1
  (9999, 3406)	1
  (9999, 553)	1
  (9999, 395)	2
00
great
service
time
order


## Character $n$-grams

We can also use characters to analyze text:

In [22]:
char_vectorizer = CountVectorizer(analyzer='char', 
                                  ngram_range=(2, 6), 
                                  min_df=0.001, 
                                  max_df=0.75)

C = char_vectorizer.fit_transform(documents[:10])
C

<10x8054 sparse matrix of type '<class 'numpy.int64'>'
	with 10806 stored elements in Compressed Sparse Row format>

In [23]:
print(char_vectorizer.vocabulary_)

{'pr': 5953, 'ic': 4050, 'ce': 2121, 'ch': 2155, 'ng': 5194, 'ge': 3612, ' d': 382, 'da': 2407, 'ai': 1609, 'il': 4153, 'ly': 4732, 'if': 4118, 'f ': 3378, ' y': 1264, 'yo': 8014, 'ou': 5723, 'u ': 7367, 'wa': 7716, 'nt': 5298, 'ea': 2786, 'ar': 1819, 'rc': 6141, 'ti': 7183, 'nu': 5338, 'ua': 7387, 'at': 1904, 'ny': 5348, 'di': 2471, 'ff': 3452, 'fe': 3434, 'en': 3045, 'si': 6696, 'it': 4361, 'te': 7022, ' ,': 51, ', ': 1337, 'i ': 3975, 'av': 1959, 'fo': 3495, 'un': 7438, 'ap': 1805, 'pe': 5880, 'ca': 2106, 'rs': 6380, ' e': 430, 'el': 2985, 'ls': 4709, 'ew': 3324, 'wh': 7773, '. ': 1395, 'ho': 3918, 'ow': 5794, 'we': 7736, 'ev': 3306, 'do': 2505, " '": 17, "' ": 1295, 'a ': 1501, ' l': 690, 'lo': 4679, 'ot': 5699, ' o': 785, 'of': 5474, 'im': 4177, 'hi': 3896, 'as': 1867, 'lw': 4727, 'ay': 1987, 'ys': 8040, ' b': 296, 'be': 2021, 'ee': 2919, 'am': 1680, 'mo': 4891, 'g ': 3554, 'op': 5653, 'p ': 5817, 'hr': 3943, ' (': 34, '( ': 1318, ' g': 524, '.,': 1449, ' )': 42, ') ': 1327, ' u':

## Syntactic $n$-grams

In [24]:
import spacy
nlp = spacy.load('en')

# here we are doing pre-processing before vectorizer
# in order to create more complex ngrams (like considering
# the syntatic aspect), we should work at pre-processing level
# as we did here.

features = [' '.join(["{}_{}".format(token.lemma_, token.head.lemma_) 
                      for token in nlp(sentence)])
            for sentence in documents[:100]]

syntax_vectorizer = CountVectorizer()
X = syntax_vectorizer.fit_transform(features)

In [25]:
print(documents[0])
print(features[0])

Prices change daily and if you want to really research the price continually at many different sites , I have found cheaper cars elsewhere . However , if you don ' t have a lot of time to research the price , this site has always been among the top three ( e . g ., cheapest ) of the ten sites I use to reserve a car .
price_change change_change daily_change and_change if_want -PRON-_want want_find to_research really_research research_want the_price price_research continually_research at_research many_site different_site site_at ,_find -PRON-_find have_find find_change cheap_car car_find elsewhere_find ._find however_be ,_be if_have -PRON-_have don_-PRON- '_t t_have have_be a_lot lot_have of_lot time_of to_research research_lot the_price price_research ,_be this_site site_be have_be always_be be_be among_be the_three top_three three_among (_. e_. ._three g_. ._three ,_cheap cheap_cheap )_cheap of_cheap the_site ten_site site_of -PRON-_use use_site to_reserve reserve_use a_car car_reserve

In [26]:
print(syntax_vectorizer.vocabulary_)

{'price_change': 3008, 'change_change': 1257, 'daily_change': 1400, 'and_change': 725, 'if_want': 2194, 'pron': 3072, '_want': 420, 'want_find': 4199, 'to_research': 3975, 'really_research': 3125, 'research_want': 3186, 'the_price': 3769, 'price_research': 3019, 'continually_research': 1357, 'at_research': 932, 'many_site': 2474, 'different_site': 1485, 'site_at': 3410, '_find': 197, 'have_find': 2073, 'find_change': 1752, 'cheap_car': 1268, 'car_find': 1235, 'elsewhere_find': 1589, 'however_be': 2150, '_be': 97, 'if_have': 2190, '_have': 222, 'don_': 1539, '_t': 380, 't_have': 3583, 'have_be': 2064, 'a_lot': 481, 'lot_have': 2424, 'of_lot': 2702, 'time_of': 3894, 'research_lot': 3185, 'this_site': 3859, 'site_be': 3411, 'always_be': 670, 'be_be': 988, 'among_be': 687, 'the_three': 3812, 'top_three': 4025, 'three_among': 3871, 'e_': 1558, '_three': 390, 'g_': 1922, '_cheap': 121, 'cheap_cheap': 1269, 'of_cheap': 2686, 'the_site': 3795, 'ten_site': 3621, 'site_of': 3414, '_use': 406, 'u

# Dense Distributed Representations

## Word embeddings with `Word2vec`

In [27]:
from gensim.models import Word2Vec
from gensim.models.word2vec import FAST_VERSION

# list of lists of tokens
corpus = [document.split() for document in documents]

# initialize model
w2v_model = Word2Vec(size=100, 
                     window=15,
                     sample=0.0001,
                     iter=200,
                     negative=5, 
                     min_count=100,
                     workers=-1, 
                     hs=0
)

# size = dimension of word vector, usually 300 and take more time
# window = how many words we want use to describe the context, so
# how many words before and after and usually it is 5-10
# iter = how many times compute the gradient


w2v_model.build_vocab(corpus) # create the vocabulary over the corpus

w2v_model.train(corpus, 
                total_examples=w2v_model.corpus_count, 
                epochs=w2v_model.epochs)

w2v_model.wv.index2entity

['.',
 'the',
 'and',
 'I',
 'to',
 ',',
 'a',
 'was',
 'of',
 'for',
 'my',
 "'",
 'it',
 'in',
 'with',
 'have',
 'on',
 'that',
 'is',
 'they',
 'you',
 'had',
 'from',
 'as',
 '0',
 'me',
 '-',
 'this',
 'but',
 'be',
 'service',
 'not',
 'very',
 'them',
 '!',
 't',
 'at',
 'are',
 'The',
 '00',
 'would',
 'so',
 'all',
 'i',
 'time',
 'delivery',
 'were',
 'order',
 'will',
 'an',
 'been',
 'good',
 'price',
 'great',
 'when',
 'no',
 'their',
 'get',
 'use',
 '(',
 'day',
 'company',
 'out',
 'can',
 'up',
 'we',
 's',
 'which',
 'one',
 'again',
 'by',
 'what',
 'ordered',
 'just',
 'or',
 'about',
 'than',
 'easy',
 'only',
 'if',
 'your',
 'prices',
 'site',
 'customer',
 'more',
 'arrived',
 'always',
 'there',
 'website',
 'days',
 'received',
 'now',
 'do',
 'back',
 'has',
 'got',
 'phone',
 'after',
 'They',
 'any',
 'recommend',
 'could',
 'other',
 'well',
 'find',
 ')',
 '"',
 'really',
 'am',
 'best',
 've',
 'first',
 'next',
 'used',
 'even',
 'like',
 'found',
 's

In [28]:
len(w2v_model.wv.vocab.keys())

3376

In [None]:
corpus

Now, we can use the embeddings of the model

In [30]:
w2v_model.wv['delivery']
# this is an array of similarities among 'delivery' and other words.

array([-0.00201604,  0.00193741,  0.00382725, -0.00033414,  0.00362758,
        0.00432345, -0.00182475, -0.00233103,  0.00321392,  0.00309775,
       -0.00059338, -0.00368568,  0.00202271, -0.00476235,  0.00403282,
        0.00371545, -0.00408697, -0.00046957, -0.00340057, -0.0036016 ,
       -0.00249131, -0.00062918,  0.00127284, -0.0026412 , -0.00014614,
        0.00360143,  0.00221511, -0.00028227,  0.00269918,  0.00386152,
        0.00222016,  0.00140975, -0.00482677,  0.00025525,  0.00406637,
        0.00043819,  0.00382985,  0.0043268 , -0.00496591, -0.0021413 ,
        0.00411662, -0.00193802, -0.00315692, -0.00366614, -0.00049131,
        0.00023933, -0.00171484,  0.00476129, -0.00170774, -0.00210903,
       -0.00230342, -0.00048112, -0.00051851,  0.00172467, -0.00470639,
        0.00385257, -0.00196409, -0.00335211, -0.00169611, -0.00211254,
        0.00378636,  0.00308515, -0.00180184, -0.00351099,  0.00099217,
       -0.00217315,  0.00033141, -0.00455476,  0.00434059,  0.00

In [31]:
w2v_model.wv.most_similar(['delivery'])
# most similar words to 'deliver', with their similarity value
# higher is better (max is 1)
# consider that in general they have not a so higher values,
# around 0.5

[('Go', 0.3252245783805847),
 ('proper', 0.3045898675918579),
 ('proud', 0.30452364683151245),
 ('credit', 0.284646600484848),
 ('c', 0.2783959209918976),
 ('economy', 0.2765747606754303),
 ('cardboard', 0.2721385359764099),
 ('locate', 0.26323026418685913),
 ('anytime', 0.26179271936416626),
 ('but', 0.2611130475997925)]

In [32]:
w2v_model.wv.most_similar(['delivery','concert'])
# words more similar to 'delivery' and 'concert'
# we should mean them like points in the space and we can concider 
# points "in the middle" between the points that we used
# consider Neighbour procedure on the vector space

[('fell', 0.4156554341316223),
 ('working', 0.34687885642051697),
 ('long', 0.31385862827301025),
 ('promotions', 0.2944999933242798),
 ('anytime', 0.2898682951927185),
 ('new', 0.2894074618816376),
 ('acceptable', 0.2862030863761902),
 ('proud', 0.2835082411766052),
 ('0', 0.27840912342071533),
 ('complaining', 0.2779654860496521)]

In [33]:
# birthday - present + husband => birthday:present as husband:?
w2v_model.wv.most_similar(positive=['birthday', 'husband'], negative=['present'], topn=3)

[('Why', 0.35734623670578003),
 ('loss', 0.34225404262542725),
 ('guarantee', 0.3208310902118683)]

In [34]:
word1 = "Cheapest"
word2 = "friendly"

# retrieve the actual vector (we will never use it)
# print(w2v_model.wv[word1])

# compare by computing similarity between two words
print(w2v_model.wv.similarity(word1, word2))

# get the 3 most similar words
print(w2v_model.wv.most_similar(word1, topn=3))


0.042392984
[('proper', 0.3280349671840668), ('verified', 0.32115039229393005), ('issues', 0.3185743987560272)]



### Exercise
Use `spacy` to restrict the words in the tweets to *content words*, i.e., nouns, verbs, and adjectives. Transform the words to lower case and add the POS with an underderscore. E.g.:

`love_VERB old-fashioneds_NOUN`

This also allows us to distinguish between homographs, i.e., words that are written the same, but belong to different word classes, e.g., *love* in "I **love** old-fashioneds" vs. "He felt so sick, it must have been **love**".


Make sure to exclude sentences that contain none of the above.

Write the resulting corpus to a variable called `word_corpus`.

In [None]:
# Your code here
import spacy
nlp = spacy.load('en')

accectable_list = ['NOUN', 'VERB', 'ADJ']
word_corpus = ['_'.join([token.lower_ , token.pos_])
                for sentece in documents[:10] 
                  for token in nlp(sentece) 
                    if token.pos_ in accectable_list]
word_corpus

Rerun the `Word2vec` model from above on the new data set and test the words out

In [62]:
# Your code here
w2v_model = {}

for i in range(0,4):
  w2v_model[i] = Word2Vec(size=100, 
                      window=15,
                      sample=0.0001,
                      iter=200,
                      negative=5, 
                      min_count=1,
                      workers=-1, 
                      hs=0)

  w2v_model[i].build_vocab(word_corpus)

  w2v_model[i].train(word_corpus, 
                  total_examples=w2v_model[i].corpus_count, 
                  epochs=w2v_model[i].epochs)

## Exercise

Train 4 more `Word2vec` models and average the resulting embedding matrices.

In [67]:
div_el = lambda i: i / 4 # divide each matrix element by a fixed fixed value
vectorized_div = np.vectorize(div_el) # make a vectorized division

In [None]:
# Your code here
average_embedding = {}
for feature in w2v_model[0].wv.index2entity: # for each word entity
  resulting_matrix = w2v_model[0].wv[feature] + w2v_model[1].wv[feature] + w2v_model[2].wv[feature] + w2v_model[3].wv[feature]
  resulting_matrix = vectorized_div(resulting_matrix)
  average_embedding[feature] = resulting_matrix

average_embedding

## Document embeddings with `Doc2Vec`

In [38]:
df.head()

Unnamed: 0,score,category,uid,gender,age,text
0,5,Car Rental,899881,F,50,Prices change daily and if you want to really ...
1,5,Fitness & Nutrition,828184,M,32,and the fact that they will match other compan...
2,5,Electronic Payment,1698375,M,48,Used Paypal for my buying and selling for the ...
3,5,Gaming,3324079,M,29,I ' ve made two purchases on CJ ' s for Fallou...
4,4,Jewelry,719816,F,29,I was very happy with the diamond that I order...


In [39]:
from gensim.models import Doc2Vec
from gensim.models.doc2vec import FAST_VERSION
from gensim.models.doc2vec import TaggedDocument

corpus = []

# representation not only at world level but also at category level

# in this case doc2vec take as input not
# only the text but also the tags
for row in df.iterrows():
    label = row[1].score # as a tag we take the score
    text = row[1].text
    corpus.append(TaggedDocument(words=text.split(), tags=[str(label)]))
    # if we econde the tag as label we have the representation of the tag (ex. 5)
    # and with that we can look for the most similar text to the most positive score

d2v_model = Doc2Vec(vector_size=100, 
                    window=15,
                    hs=0,
                    sample=0.000001,
                    negative=5,
                    min_count=100,
                    workers=-1,
                    epochs=500,
                    dm=0, 
                    dbow_words=1)

d2v_model.build_vocab(corpus)

d2v_model.train(corpus, total_examples=d2v_model.corpus_count, epochs=d2v_model.epochs)

done


We can now look at the elements

In [40]:
corpus[6]

TaggedDocument(words=['First', 'off', 'I', 'usually', 'never', 'get', 'extended', 'warranties', 'but', 'when', 'I', 'purchased', 'a', 'Hisense', 'TV', 'from', 'Tiger', 'Direct', 'something', 'told', 'me', 'to', 'get', 'the', 'warranty', 'and', 'would', 'you', 'believe', 'one', 'month', 'after', 'the', 'warranty', 'expired', 'the', 'TV', 'started', 'to', 'act', 'up', 'shuting', 'off', 'and', 'on', ',', 'I', 'called', 'square', 'trade', 'and', 'spoke', 'to', 'a', 'fine', 'gentleman', 'who', 'seemed', 'to', 'know', 'the', 'problem', 'right', 'off', ',', 'he', 'over', 'nighted', 'the', 'parts', 'and', 'set', 'up', 'a', 'TV', 'repair', 'man', 'in', 'my', 'area', 'to', 'come', 'to', 'my', 'house', 'to', 'fix', 'it', ',', 'which', 'he', 'did', 'in', 'about', 'an', 'hour', 'and', 'so', 'far', 'I', "'", 'm', 'a', 'happy', 'man', '.', 'Square', 'Trade', 'Rules'], tags=['5'])

In [41]:
d2v_model.docvecs[3]

array([-0.00023043,  0.00278153, -0.00299326, -0.00052014,  0.0002056 ,
        0.00117497, -0.00026105,  0.0019257 , -0.00236471,  0.00294526,
       -0.00087223, -0.00277281, -0.00056244,  0.00039251,  0.00155478,
       -0.00408785, -0.0011266 , -0.0047998 ,  0.00302741,  0.00398723,
        0.00434473, -0.00475034, -0.00458683, -0.0024246 , -0.00464964,
       -0.00116596,  0.00269334,  0.00148525,  0.00048501,  0.00284068,
        0.00099539, -0.00182732,  0.00248564, -0.004758  ,  0.00187041,
        0.0037552 ,  0.00362319,  0.00476838,  0.00337846,  0.00476003,
        0.00397788, -0.00132761,  0.00054146,  0.00245838, -0.00229237,
       -0.00492547,  0.0010991 , -0.00435969, -0.00058782, -0.00277903,
       -0.00105006, -0.00333356,  0.00396559, -0.00418733, -0.0029618 ,
       -0.00165325,  0.00410667,  0.00139435, -0.00157359,  0.0018224 ,
        0.00469214, -0.00396327, -0.0016568 ,  0.00484395,  0.00028675,
       -0.00400514,  0.00053792, -0.00035401, -0.0045846 ,  0.00

In [42]:
d2v_model.docvecs.doctags

{'1': Doctag(offset=2, word_count=1205430, doc_count=7316),
 '2': Doctag(offset=3, word_count=301478, doc_count=2197),
 '3': Doctag(offset=4, word_count=254820, doc_count=2496),
 '4': Doctag(offset=1, word_count=604853, doc_count=9164),
 '5': Doctag(offset=0, word_count=4205492, doc_count=78827)}

In [43]:
target_doc = '1'

similar_docs = d2v_model.docvecs.most_similar(target_doc, topn=5)
print(similar_docs)
# we compute the similarity with the other tags

[('5', 0.12073013186454773), ('3', 0.10586314648389816), ('4', -0.029272064566612244), ('2', -0.15278185904026031)]


## Exercise

What are the 10 most similar ***words*** to each category?

In [44]:
d2v_model.docvecs.index2entity

['5', '4', '1', '2', '3']

In [45]:
d2v_model.wv.index2entity
len(d2v_model.wv.index2entity)

3376

In [46]:
len(d2v_model.wv.vocab.keys())

3376

In [47]:
{tag : [word_sim[0]
        for word_sim in d2v_model.wv.similar_by_vector(d2v_model.docvecs[tag], topn = 10)] 
  for tag in d2v_model.docvecs.index2entity}

{'1': ['gift',
  '=',
  'themselves',
  'Chat',
  'about',
  'isn',
  'printer',
  'sorting',
  'updated',
  'navigate'],
 '2': ['date',
  'successfully',
  'everytime',
  'offered',
  'superior',
  'router',
  'automated',
  'people',
  'saves',
  'clean'],
 '3': ['messed',
  'aware',
  'considering',
  'respond',
  'management',
  'pump',
  'countless',
  'F',
  'incredibly',
  'getting'],
 '4': ['parts',
  'during',
  'Warehouse',
  'trips',
  'difference',
  'confirming',
  '0st',
  'PC',
  'risk',
  'IT'],
 '5': ['massive',
  'others',
  'freezer',
  'Ebay',
  'pm',
  'ie',
  'slots',
  '00st',
  'mark',
  'little']}