# CS224N - Intro & Word Vectors (Lecture 1)

**Common NLP solution:** Use, e.g., WordNet, a thesaurus containing lists of **synonym sets** and **hypernyms** ("is a" relationships):

In [2]:
import nltk
nltk.download('wordnet')

# e.g., synonym sets containing "good":
from nltk.corpus import wordnet as wn
poses = { 'n':'noun', 'v':'verb', 's':'adj (s)', 'a':'adj', 'r':'adv' }
for synset in wn.synsets('good'):
    print('{}: {}'.format(poses[synset.pos()],
            ", ".join([l.name() for l in synset.lemmas()])))

[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/stefanjenss/nltk_data...


noun: good
noun: good, goodness
noun: good, goodness
noun: commodity, trade_good, good
adj: good
adj (s): full, good
adj: good
adj (s): estimable, good, honorable, respectable
adj (s): beneficial, good
adj (s): good
adj (s): good, just, upright
adj (s): adept, expert, good, practiced, proficient, skillful, skilful
adj (s): good
adj (s): dear, good, near
adj (s): dependable, good, safe, secure
adj (s): good, right, ripe
adj (s): good, well
adj (s): effective, good, in_effect, in_force
adj (s): good
adj (s): good, serious
adj (s): good, sound
adj (s): good, salutary
adj (s): good, honest
adj (s): good, undecomposed, unspoiled, unspoilt
adj (s): good
adv: well, good
adv: thoroughly, soundly, good


In [3]:
# e.g., synonyms of "panda":
from nltk.corpus import wordnet as wn
panda = wn.synset('panda.n.01')
hyper = lambda s: s.hypernyms()
list(panda.closure(hyper))

[Synset('procyonid.n.01'),
 Synset('carnivore.n.01'),
 Synset('placental.n.01'),
 Synset('mammal.n.01'),
 Synset('vertebrate.n.01'),
 Synset('chordate.n.01'),
 Synset('animal.n.01'),
 Synset('organism.n.01'),
 Synset('living_thing.n.01'),
 Synset('whole.n.02'),
 Synset('object.n.01'),
 Synset('physical_entity.n.01'),
 Synset('entity.n.01')]

### Problems with resources like WordNet
- Great as a resource but missing nuance
    - e.g., "proficient" is listed as a synonym for "good"
        - This is only correct in some contexts
- Missing new meanings of words
    - e.g., wicked, badass, nifty, wizard, genius, ninja, bombest
    - Impossible to keep up-to-date
_ Subjective
- Requires human labor to create and adapt
- Can't compute acccurate word similarity

### Representing words as discrete symbols
- In traditional NLP, we regard words as discrete symbols: hotel, conference, motel - a localist represetnations
- Such symbols for wrods can be represented by one-hot vectors

### Representing words by their context
- ***Distributional semantics:*** **A word's meaning is given by the words that frequently appear close-by**
    - When a word *w* appears in a text, its **context** is the set of words that appear nearby (within a fixed-size window).
    - These ***context words*** will represent **banking**

### Word vectors
We will build a dense vector for each word, choen so that it is similar to vectors of words that appear in similar contexts

- Note: ***word vectors*** are called ***word embeddings*** or ***(neural) word represetnations***
    - They are a ***distributed*** representation

### Word2Vec: Overview
***Word2Vec*** (Mikolov et al. 2013) is a grameword for learning word vectors
Idea:
- We have a large corpus ("body") of text
- Every word in a fixed vocabulary is represented by a ***vector***
- Go through each position *t* in the text, which has a center word *c* and context ("outside") word *o*
- Use the ***similarity of the word vectors*** for *c* and *o* to ***calculate the probability*** of *o* given *c* (or visa versa)
- ***Keep adjusting the word vectors*** to maximiize this probability

## Gensim word vector vissualization of various word vectors

In [2]:
import numpy as np

# Instal the gensim package into the current Jupyter kernel
!pip install gensim

# Get the interactive Tools for Matplotlib
%matplotlib notebook
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('ggplot')

from sklearn.decomposition import PCA

import gensim.downloader as api
from gensim.models import KeyedVectors

Collecting gensim
  Downloading gensim-4.3.2-cp310-cp310-macosx_11_0_arm64.whl.metadata (8.4 kB)
Collecting smart-open>=1.8.1 (from gensim)
  Using cached smart_open-7.0.4-py3-none-any.whl.metadata (23 kB)
Collecting wrapt (from smart-open>=1.8.1->gensim)
  Downloading wrapt-1.16.0-cp310-cp310-macosx_11_0_arm64.whl.metadata (6.6 kB)
Downloading gensim-4.3.2-cp310-cp310-macosx_11_0_arm64.whl (24.0 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.0/24.0 MB[0m [31m33.2 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hUsing cached smart_open-7.0.4-py3-none-any.whl (61 kB)
Downloading wrapt-1.16.0-cp310-cp310-macosx_11_0_arm64.whl (38 kB)
Installing collected packages: wrapt, smart-open, gensim
Successfully installed gensim-4.3.2 smart-open-7.0.4 wrapt-1.16.0


For looking at word vecgtors, use Gensim We also use it for word vectors. Gensim isn't really a deep learning package. It's a package for word and text similarity modeling, which started with (LDA-style) topic models and grew into SVD and neural word representations. But its efficient and scalable, and quite widely used.

Stanford has a homegrown offering of GloVe word vectors. Gensim provides a library of several sets of word vectors that you can easily load. You can find out more about GloVe on the Glove page. The professor uses the 100d vectors below as a balance between speed and smallness vs. quality. If you try out the 50d vectors, they basically word for similarity but clearly aren't as good for analogy problems. If you load the 300d vectors, you'll wait longer, but they're even better than the 100d vectors.

In [3]:
model = api.load("glove-wiki-gigaword-100")
print(type(model))

<class 'gensim.models.keyedvectors.KeyedVectors'>


In [4]:
model['bread']

array([-0.66146  ,  0.94335  , -0.72214  ,  0.17403  , -0.42524  ,
        0.36303  ,  1.0135   , -0.14802  ,  0.25817  , -0.20326  ,
       -0.64338  ,  0.16632  ,  0.61518  ,  1.397    , -0.094506 ,
        0.0041843, -0.18976  , -0.55421  , -0.39371  , -0.22501  ,
       -0.34643  ,  0.32076  ,  0.34395  , -0.7034   ,  0.23932  ,
        0.69951  , -0.16461  , -0.31819  , -0.34034  , -0.44906  ,
       -0.069667 ,  0.35348  ,  0.17498  , -0.95057  , -0.2209   ,
        1.0647   ,  0.23231  ,  0.32569  ,  0.47662  , -1.1206   ,
        0.28168  , -0.75172  , -0.54654  , -0.66337  ,  0.34804  ,
       -0.69058  , -0.77092  , -0.40167  , -0.069351 , -0.049238 ,
       -0.39351  ,  0.16735  , -0.14512  ,  1.0083   , -1.0608   ,
       -0.87314  , -0.29339  ,  0.68278  ,  0.61634  , -0.088844 ,
        0.88094  ,  0.099809 , -0.27161  , -0.58026  ,  0.50364  ,
       -0.93814  ,  0.67576  , -0.43124  , -0.10517  , -1.2404   ,
       -0.74353  ,  0.28637  ,  0.29012  ,  0.89377  ,  0.6740

In [5]:
model['croissant']

array([-0.25144  ,  0.52157  , -0.75452  ,  0.28039  , -0.31388  ,
        0.274    ,  1.1971   , -0.10519  ,  0.82544  , -0.33398  ,
       -0.21417  ,  0.22216  ,  0.14982  ,  0.47384  ,  0.41984  ,
        0.69397  , -0.25999  , -0.44414  ,  0.58296  , -0.30851  ,
       -0.076455 ,  0.33468  ,  0.28055  , -0.99012  ,  0.30349  ,
        0.39128  ,  0.031526 , -0.095395 , -0.004745 , -0.81347  ,
        0.27869  , -0.1812   ,  0.14632  , -0.42186  ,  0.13857  ,
        1.139    ,  0.14925  , -0.051459 ,  0.37875  , -0.2613   ,
        0.011081 , -0.28881  , -0.38662  , -0.3135   , -0.1954   ,
        0.19248  , -0.52995  , -0.40674  , -0.25159  ,  0.06272  ,
       -0.32724  ,  0.28374  , -0.2155   , -0.061832 , -0.50134  ,
        0.0093959,  0.30715  ,  0.3873   , -0.74554  , -0.45947  ,
        0.40032  , -0.1378   , -0.26968  , -0.3946   , -0.64876  ,
       -0.47149  , -0.085536 ,  0.092795 , -0.034018 , -0.61906  ,
        0.19123  ,  0.20563  ,  0.29056  , -0.010908 ,  0.1531

In [6]:
model.most_similar('usa')

[('canada', 0.6544383764266968),
 ('america', 0.645224392414093),
 ('u.s.a.', 0.6184033751487732),
 ('united', 0.6017189621925354),
 ('states', 0.5970699191093445),
 ('australia', 0.5838716626167297),
 ('world', 0.5590084195137024),
 ('2010', 0.558070182800293),
 ('2012', 0.5504006743431091),
 ('davis', 0.5464468002319336)]

In [7]:
model.most_similar('banana')

[('coconut', 0.7097253799438477),
 ('mango', 0.705482542514801),
 ('bananas', 0.6887733936309814),
 ('potato', 0.6629635691642761),
 ('pineapple', 0.6534532308578491),
 ('fruit', 0.6519854664802551),
 ('peanut', 0.6420575976371765),
 ('pecan', 0.6349172592163086),
 ('cashew', 0.6294420957565308),
 ('papaya', 0.6246591210365295)]

In [8]:
model.most_similar('croissant')

[('croissants', 0.682984471321106),
 ('brioche', 0.6283302903175354),
 ('baguette', 0.5968102812767029),
 ('focaccia', 0.5876684188842773),
 ('pudding', 0.5803956389427185),
 ('souffle', 0.5614769458770752),
 ('baguettes', 0.5558240413665771),
 ('tortilla', 0.5449503064155579),
 ('pastries', 0.5427730679512024),
 ('calzone', 0.5374531745910645)]

In [9]:
model.most_similar(negative='banana')

[('shunichi', 0.49618101119995117),
 ('ieronymos', 0.4736502170562744),
 ('pengrowth', 0.4668096601963043),
 ('höss', 0.4636845886707306),
 ('damaskinos', 0.4617849290370941),
 ('yadin', 0.4617374837398529),
 ('hundertwasser', 0.4588957726955414),
 ('ncpa', 0.4577339291572571),
 ('maccormac', 0.4566109776496887),
 ('rothfeld', 0.4523947536945343)]

In [10]:
result = model.most_similar(positive=['woman', 'king'], negative=['man'])
print("{}: {:.4f}".format(*result[0]))

queen: 0.7699


In [11]:
# x1: x2 :: y1 :: returned
def analogy(x1, x2, y1):
    result = model.most_similar(positive=[y1, x2], negative=[x1])
    return result[0][0]

In [12]:
analogy('man', 'king', 'queen')

'monarch'

In [21]:
analogy('king', 'man', 'queen')

'woman'

In [20]:
print(model.doesnt_match("breakfast cereal dinner lunch".split()))

cereal
