<br><br><br>

## SCI 498 - 410: Online Social Network Analysis
### Lecture 7, pt 2
<br><br>
### Aron Culotta
### Illinois Institute of Technology  


<br><br><br>

## Word representations

Up to now, we've represented words by simply mapping them to a unique column index. 

E.g., the document "happy" becomes

$$[0,0,1,0,\ldots,0]$$

where "happy" is assigned index 2.

<br><br>
However, this representation ignores any similarity between related words.

E.g., "glad" may be

$$[0,0,0,0,0,0,1,\ldots,0]$$

where "glad" is assigned index 6

There is no way for us to tell that "happy" and "glad" are similar terms using this word representation.

<br><br>

<u> Why does this matter? </u>

- ** Reason 1: Statistical efficiency **

Recall our logistic regression model, which has a separate coefficient $\theta_j$ for each term $w_j$

We would expect the coefficient $\theta_{\mathrm{glad}}$ to be similar to the coefficient $\theta_{\mathrm{happy}}$.

Or, perhaps we could collapse these coefficients into a single $\theta_{\mathrm{positive\_emotion}}$

Recall that the quality of our estimates for each $\theta_j$ depends in part on the number of training examples containing term $w_j$. By collapsing terms, or enforcing that similar terms have similar coefficients, we can make more efficient use of the limited training data we have.

<br><br>

- **  Reason 2: Out-of-vocabulary words **

Given our limited training data, there are many words which may appear in the testing data but not the training data.

E.g., "elated" has no corresponding $\theta$ coefficient if it does not appear in the training data.

Even more important for informal text (SMS, social media) where abbreviations, emoticons, and neologisms abound.

## Language models

One way to represent words is to summarize the contexts in which they appear.

<u>Assumption</u>: words that appear in similar contexts have similar semantics or syntactic functions.

E.g., what are the most probable words $p(w_i \mid \mathrm{"I\: feel\: so"})$?
 - happy
 - glad
 - ** sad **

 
<br><br>
How do we represent "similar contexts"?

A ngram language model: 

$$p(w_i \mid w_{i-1} \ldots w_{i-n})$$




<u> Idea:</u> represent each word as a vector of values
- Words with similar vectors should be similar

## Language models as classification

$$p(w_i \mid w_{i-1} \ldots w_{i-n})$$

Predict word $w_i$ given as "features" the prior $n$ words.


Any classifier can be used (Naive Bayes, logistic regression, neural nets, ...)
- class labels: all possible words in the vocabulary
- features: the words that appear around word $w_i$

In [1]:
docs = ['I am Sam',
        'You are Sam',
        'Sam I am',
        'I do not like green eggs and ham',
        'Sam I was',
        'I am Dan',
       ]

In [2]:
from collections import Counter

def iter_ngrams(doc, n):
    """Return a generator over ngrams of a document.
    Params:
      doc...list of tokens
      n.....size of ngrams"""
    return (doc[i : i+n] for i in range(len(doc)-n+1))

def iterate_examples(docs, n):
    for doc in docs:
        for ngram in iter_ngrams(doc.split(), n): 
            yield ngram[:-1], ngram[-1]
            
            
[x for x in iterate_examples(docs, 3)]

[(['I', 'am'], 'Sam'),
 (['You', 'are'], 'Sam'),
 (['Sam', 'I'], 'am'),
 (['I', 'do'], 'not'),
 (['do', 'not'], 'like'),
 (['not', 'like'], 'green'),
 (['like', 'green'], 'eggs'),
 (['green', 'eggs'], 'and'),
 (['eggs', 'and'], 'ham'),
 (['Sam', 'I'], 'was'),
 (['I', 'am'], 'Dan')]

In [3]:
# DictVectorizer: useful for creating sparse matrices from a list of dicts.

from sklearn.feature_extraction import DictVectorizer
vec = DictVectorizer()
X = vec.fit_transform([
                        {'a': 10, 'b': 1},
                        {'b': 100, 'c': 1000},
                      ])
print('feature names:\n', vec.get_feature_names())
print('feature matrix:\n', X.todense())

feature names:
 ['a', 'b', 'c']
feature matrix:
 [[  10.    1.    0.]
 [   0.  100. 1000.]]


In [4]:
import numpy as np

# Convert preceeding terms into a single feauture per instance.
# E.g., ["I", "am"] -> "I_am" 
vec = DictVectorizer()
X = vec.fit_transform({'_'.join(x[0]): 1} for x in iterate_examples(docs, 3))
y = np.array([x[1] for x in iterate_examples(docs, 3)])
print('feature names:\n', vec.get_feature_names())
print('feature matrix:\n', X.todense())
print('labels:\n', y)

feature names:
 ['I_am', 'I_do', 'Sam_I', 'You_are', 'do_not', 'eggs_and', 'green_eggs', 'like_green', 'not_like']
feature matrix:
 [[1. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 1. 0. 0. 0. 0. 0.]
 [0. 0. 1. 0. 0. 0. 0. 0. 0.]
 [0. 1. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 1. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 1.]
 [0. 0. 0. 0. 0. 0. 0. 1. 0.]
 [0. 0. 0. 0. 0. 0. 1. 0. 0.]
 [0. 0. 0. 0. 0. 1. 0. 0. 0.]
 [0. 0. 1. 0. 0. 0. 0. 0. 0.]
 [1. 0. 0. 0. 0. 0. 0. 0. 0.]]
labels:
 ['Sam' 'Sam' 'am' 'not' 'like' 'green' 'eggs' 'and' 'ham' 'was' 'Dan']


In [6]:
from sklearn.linear_model import LogisticRegression

clf = LogisticRegression(solver='lbfgs', multi_class='auto')
clf.fit(X, y)
print('class labels:\n', clf.classes_)
print('coefficients:\n', clf.coef_)
print('feature names:\n', vec.get_feature_names())

class labels:
 ['Dan' 'Sam' 'am' 'and' 'eggs' 'green' 'ham' 'like' 'not' 'was']
coefficients:
 [[ 0.68507353 -0.07781821 -0.14572623 -0.07249646 -0.07781821 -0.07781821
  -0.07781821 -0.07781821 -0.07781821]
 [ 0.4755455  -0.14885424 -0.26547935  0.68304691 -0.14885424 -0.14885424
  -0.14885424 -0.14885424 -0.14885424]
 [-0.13722502 -0.07721724  0.67248075 -0.07193395 -0.07721724 -0.07721724
  -0.07721724 -0.07721724 -0.07721724]
 [-0.14769483 -0.08346053 -0.15562599 -0.07778043 -0.08346053 -0.08346053
   0.7984096  -0.08346053 -0.08346053]
 [-0.14769483 -0.08346053 -0.15562599 -0.07778043 -0.08346053 -0.08346053
  -0.08346053  0.7984096  -0.08346053]
 [-0.14769483 -0.08346053 -0.15562599 -0.07778043 -0.08346053 -0.08346053
  -0.08346053 -0.08346053  0.7984096 ]
 [-0.14769483 -0.08346053 -0.15562599 -0.07778043 -0.08346053  0.7984096
  -0.08346053 -0.08346053 -0.08346053]
 [-0.14769483 -0.08346053 -0.15562599 -0.07778043  0.7984096  -0.08346053
  -0.08346053 -0.08346053 -0.08346053]
 [

In [12]:
clf.intercept_

array([-0.11575445,  0.60393625, -0.12401608, -0.04002494, -0.04002494,
       -0.04002494, -0.04002494, -0.04002494, -0.04002494, -0.12401608])

In [7]:
# prettier as a Pandas DataFrame.
import pandas as pd
df = pd.DataFrame(clf.coef_, columns=vec.get_feature_names(), index=clf.classes_)
df

Unnamed: 0,I_am,I_do,Sam_I,You_are,do_not,eggs_and,green_eggs,like_green,not_like
Dan,0.685074,-0.077818,-0.145726,-0.072496,-0.077818,-0.077818,-0.077818,-0.077818,-0.077818
Sam,0.475546,-0.148854,-0.265479,0.683047,-0.148854,-0.148854,-0.148854,-0.148854,-0.148854
am,-0.137225,-0.077217,0.672481,-0.071934,-0.077217,-0.077217,-0.077217,-0.077217,-0.077217
and,-0.147695,-0.083461,-0.155626,-0.07778,-0.083461,-0.083461,0.79841,-0.083461,-0.083461
eggs,-0.147695,-0.083461,-0.155626,-0.07778,-0.083461,-0.083461,-0.083461,0.79841,-0.083461
green,-0.147695,-0.083461,-0.155626,-0.07778,-0.083461,-0.083461,-0.083461,-0.083461,0.79841
ham,-0.147695,-0.083461,-0.155626,-0.07778,-0.083461,0.79841,-0.083461,-0.083461,-0.083461
like,-0.147695,-0.083461,-0.155626,-0.07778,0.79841,-0.083461,-0.083461,-0.083461,-0.083461
not,-0.147695,0.79841,-0.155626,-0.07778,-0.083461,-0.083461,-0.083461,-0.083461,-0.083461
was,-0.137225,-0.077217,0.672481,-0.071934,-0.077217,-0.077217,-0.077217,-0.077217,-0.077217


In [11]:
# Can now use the classifier to predict the next word, 
# given the previous words.
clf.predict_proba(vec.transform({'I_am': 1}))
#clf.predict(vec.transform({'I_am': 1}))

array([[0.15744332, 0.26223163, 0.0686146 , 0.07384931, 0.07384931,
        0.07384931, 0.07384931, 0.07384931, 0.07384931, 0.0686146 ]])

## Insight

- Each word $w_i$ has a separate $\theta_i$ vector in the classifier.
- High values $\theta_{ij} \in \theta_i$ means that bigram $j$ is predictive of word $i$
- Perhaps words with similar vectors are also similar?
  - Appear in similar contexts

In [9]:
# Which preceding bigrams are predictive of Dan?
df.loc['Dan'].sort_values(ascending=False)

I_am          0.685074
You_are      -0.072496
not_like     -0.077818
like_green   -0.077818
green_eggs   -0.077818
eggs_and     -0.077818
do_not       -0.077818
I_do         -0.077818
Sam_I        -0.145726
Name: Dan, dtype: float64

In [10]:
# Which preceding bigrams are predictive of Sam?
df.loc['Sam'].sort_values(ascending=False)

You_are       0.683047
I_am          0.475546
not_like     -0.148854
like_green   -0.148854
green_eggs   -0.148854
eggs_and     -0.148854
do_not       -0.148854
I_do         -0.148854
Sam_I        -0.265479
Name: Sam, dtype: float64

** Cosine similarity **

A common way of measuring similarity between vectors:

$$ cos(x, y) = \frac{\sum_{i} x_i * y_i}{\sqrt{\sum_i x_i^2} \sqrt{\sum_i y_i^2}}$$

1 $\rightarrow$ $x$ and $y$ are identical  
-1 $\rightarrow$ $x$ and $y$ are opposite

![cos](figs/cos.png)

In [14]:
# Are words with similar coefficients related?
from math import sqrt

def similarity(word1, word2, clf):
    # find the coefficient vector for each word
    i1 = list(clf.classes_).index(word1)
    i2 = list(clf.classes_).index(word2)
    coef1 = clf.coef_[i1]
    coef2 = clf.coef_[i2]
    # compute cosine similarity
    return np.dot(coef1, coef2) / (sqrt(np.dot(coef1, coef1)) * sqrt(np.dot(coef2, coef2)))
    
similarity('Sam', 'Dan', clf)

0.556728065026134

In [15]:
similarity('Sam', 'am', clf)

-0.330628244369295

We can do the same using a neural network, though now we will also have a hidden layer.

In [16]:
from sklearn.neural_network import MLPClassifier

mlp = MLPClassifier(hidden_layer_sizes=[2],
                    activation='logistic',
                    solver='lbfgs',
                    random_state=1234)

mlp.fit(X, y)
mlp.coefs_

[array([[ -6.23644874,  15.96633581],
        [ -7.06618913,   0.1457164 ],
        [ -0.78687689,  -1.42733742],
        [  0.40336114,  11.04405796],
        [  9.0090014 ,   6.82401002],
        [  0.21178074,  -8.77437404],
        [ -8.07283676, -14.48600641],
        [  7.47747691,  -0.72793465],
        [  6.49598302,  -8.13583693]]),
 array([[-17.56290745,  -5.84015556,  -3.029439  , -18.70573581,
          22.17871973,  16.81885779,   3.00446562,  16.95826254,
         -10.71281826,  -3.06820691],
        [ 39.28977668,  39.87983503,  -4.19210384, -42.36542536,
          -0.80416064, -23.77781269, -37.90565309,  17.64933774,
          17.20565371,  -4.38586329]])]

## Word vectors

If we run the above approach on a very large, unlabeled dataset, we can associate a parameter vector $\theta_i$ with each word $i$.

We can use this vector to represent each word.

In [17]:
df

Unnamed: 0,I_am,I_do,Sam_I,You_are,do_not,eggs_and,green_eggs,like_green,not_like
Dan,0.685074,-0.077818,-0.145726,-0.072496,-0.077818,-0.077818,-0.077818,-0.077818,-0.077818
Sam,0.475546,-0.148854,-0.265479,0.683047,-0.148854,-0.148854,-0.148854,-0.148854,-0.148854
am,-0.137225,-0.077217,0.672481,-0.071934,-0.077217,-0.077217,-0.077217,-0.077217,-0.077217
and,-0.147695,-0.083461,-0.155626,-0.07778,-0.083461,-0.083461,0.79841,-0.083461,-0.083461
eggs,-0.147695,-0.083461,-0.155626,-0.07778,-0.083461,-0.083461,-0.083461,0.79841,-0.083461
green,-0.147695,-0.083461,-0.155626,-0.07778,-0.083461,-0.083461,-0.083461,-0.083461,0.79841
ham,-0.147695,-0.083461,-0.155626,-0.07778,-0.083461,0.79841,-0.083461,-0.083461,-0.083461
like,-0.147695,-0.083461,-0.155626,-0.07778,0.79841,-0.083461,-0.083461,-0.083461,-0.083461
not,-0.147695,0.79841,-0.155626,-0.07778,-0.083461,-0.083461,-0.083461,-0.083461,-0.083461
was,-0.137225,-0.077217,0.672481,-0.071934,-0.077217,-0.077217,-0.077217,-0.077217,-0.077217


<u>Problems with this approach?</u>

<br><br><br>

- Scalability: for an ngram model, we must have a different parameter for every distinct ngram in the data
- Sparsity: for a given word, only a small number of ngram features will be present. E.g.:
  - "I am so happy" 
  - "You are super glad"
  - In this example, there would be no feature overlap, reducing the similarity between "happy" and "glad"
  
  
<u> Solution </u>
- Use a small, dense vector to represent each word.
- When computing features, use this word vector to represent preceding words, rather than the words themselves.


$p(w_i \mid w_{i-1} \ldots w_{i-n})$ becomes  
$p(w_i \mid v_{i-1} \ldots v_{i-n})$  
where $v_i$ is the vector representation of word $i$.

<u> Objective </u>

Jointly learn parameters and word representation to enable prediction of $p(w_i \mid v_{i-1} \ldots v_{i-n})$.

In [18]:
# E.g., fix dimension of each word vector to 3
# Initialze to small random values in [-.1, .1]
vec_dim = 3
np.random.seed(1234)
word_vectors = np.random.uniform(-.1, .1, (len(clf.classes_), 3))
pd.DataFrame(word_vectors, index=clf.classes_)
# Want to update these parameters so that similar words have similar values.

Unnamed: 0,0,1,2
Dan,-0.061696,0.024422,-0.012454
Sam,0.057072,0.055995,-0.045481
am,-0.044707,0.060374,0.091628
and,0.075187,-0.028437,0.000199
eggs,0.036693,0.04254,-0.02595
green,0.012239,0.000617,-0.097246
ham,0.054565,0.076528,-0.027023
like,0.023079,-0.084924,-0.026235
not,0.086628,0.030276,-0.020559
was,0.057746,-0.036633,0.01362


## Feature representation
To represent features in classifier: 
$p(w_i \mid v_{i-1} \ldots v_{i-n})$  

we will concatenate the vectors for each prior word.

E.g.  features for
$p(w_i \mid \mathrm{green\: eggs})$ are  
$[0.012, 0.001, -0.097, 0.037, 0.043, -0.026]$

But, how will we optimize these vectors?

Just doing logistic regression will associate weights with each vector element, but not change the word representations.



<u> Neural nets to the rescue! </u>

![nn](figs/nn.png)

- Word vectors become hidden nodes in a neural network.
- An additional hidden layer allows non-linear transformations of word vectors
- Training the model to optimize $p(w_i \mid v_{i-1} \ldots v_{i-n})$ results in "useful" vectors for $v$.

## Other architectures

There have been many architectures proposed to learn useful word vectors:

$p(w_i \mid w_1 \ldots w_n)$ (before and after context)

![cbow](figs/cbow.png)

$p(w_1 \ldots w_n \mid w_i)$ (predict context given current word)

![skip](figs/skip.png)

## Visualizing word vectors

![vis1.jpg](figs/vis1.jpg)

![vis2.jpg](figs/vis2.png)

![vis3.jpg](figs/vis3.png)



** image sources **

- https://engineering.aweber.com/cosine-similarity/
- http://www-personal.umich.edu/~ronxin/pdf/w2vexp.pdf
- https://deeplearning4j.org/img/countries_capitals.png
- https://adriancolyer.files.wordpress.com/2016/04/word2vec-gender-relation.png?w=600

In [1]:
from IPython.core.display import HTML
HTML(open('../custom.css').read())