<a href="https://colab.research.google.com/github/Rosefinch-Midsummer/Awesome-Colab/blob/master/NLP/Word_Embeddings.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# GloVe Word Embeddings Demo

This demo was part of a presentation for [this word embeddings workshop](https://www.eventbrite.com/e/practical-ai-for-female-engineers-product-managers-and-designers-tickets-34805104003) and a talk at [the Demystifying AI conference](https://www.eventbrite.com/e/demystifying-deep-learning-ai-tickets-34351888423).  It is not necessary to download the demo to be able to follow along and enjoy the workshop.

It is available on Github at https://github.com/fastai/word-embeddings-workshop

## Loading our data

### Imports

In [0]:
#Standard Imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

import pickle
import re
import json

In [0]:
np.set_printoptions(precision=4, suppress=True)

The dataset is available at http://files.fast.ai/models/glove/6B.100d.tgz
To download and unzip the files from the command line, you can run:

    wget http://files.fast.ai/models/glove_50_glove_100.tgz 
    tar xvzf glove_50_glove_100.tgz

In [0]:
!wget http://files.fast.ai/models/glove_50_glove_100.tgz 
!tar xvzf glove_50_glove_100.tgz

--2019-12-23 06:11:52--  http://files.fast.ai/models/glove_50_glove_100.tgz
Resolving files.fast.ai (files.fast.ai)... 67.205.15.147
Connecting to files.fast.ai (files.fast.ai)|67.205.15.147|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 225083583 (215M) [application/x-gtar-compressed]
Saving to: ‘glove_50_glove_100.tgz’


2019-12-23 06:11:59 (31.8 MB/s) - ‘glove_50_glove_100.tgz’ saved [225083583/225083583]

glove_vectors_100d.npy
glove_vectors_50d.npy
words.txt
wordsidx.txt


You will need to update the path below to be accurate for where you are storing the data.

In [0]:
vecs = np.load("glove_vectors_100d.npy")
vecs50 = np.load("glove_vectors_50d.npy")

In [0]:
with open('words.txt') as f:
    content = f.readlines()
words = [x.strip() for x in content] 

In [0]:
wordidx = json.load(open('wordsidx.txt'))

### What the data looks like

Let's see what our data looks like:

In [0]:
len(words)

400000

In [0]:
words[:10]

['the', ',', '.', 'of', 'to', 'and', 'in', 'a', '"', "'s"]

In [0]:
words[600:610]

['together',
 'congress',
 'index',
 'australia',
 'results',
 'hard',
 'hours',
 'land',
 'action',
 'higher']

wordidx allows us to look up a word in order to find out it's index:

In [0]:
type(wordidx)

dict

In [0]:
wordidx['feminist']

11853

In [0]:
words[11853]

'feminist'

## Words as vectors

The word "intelligence" is represented by the 100 dimensional vector:

In [0]:
type(vecs)

numpy.ndarray

In [0]:
vecs[11853]

array([ 0.296 ,  0.7626, -0.9866,  0.3776,  0.3194,  0.8286, -0.1686,
       -1.4558,  0.1965,  0.3854, -0.3348, -0.6503, -0.2528, -0.11  ,
       -0.1545,  0.5354, -0.4527, -0.0516,  0.1312,  0.0744,  0.5001,
        0.2151,  0.0688,  0.4347,  0.261 , -0.0371,  0.1385, -1.518 ,
        0.0641,  0.149 , -0.0314,  0.5038,  0.2839,  0.3457, -0.4411,
       -0.3459, -0.2118,  0.5651, -0.088 , -0.0438, -1.2228,  0.6039,
       -0.23  ,  0.2287, -0.2695, -0.9398,  0.2376,  0.3302, -0.2422,
        0.6359,  0.1347,  0.5542,  0.1432,  0.2861,  0.0216, -0.7437,
        0.3508,  0.362 ,  0.5566,  0.3403,  0.3613,  0.5185, -0.5437,
       -0.285 ,  1.1831, -0.1192,  0.2473,  0.0614,  0.4436, -0.244 ,
        0.2016,  0.5143, -0.4695, -0.0974, -0.9836, -0.3594,  0.3903,
       -0.517 , -0.1659, -1.2132, -1.3228,  0.0578,  0.7022,  0.3492,
       -0.9103, -0.381 , -0.1545,  0.4467, -0.009 , -0.9838,  1.0114,
       -0.227 ,  0.2697,  0.1566,  0.5613,  0.1175, -0.5755, -0.6324,
        0.1052,  1.2

This lets us do some useful calculations. For instance, we can see how far apart two words are using a distance metric:

In [0]:
from scipy.spatial.distance import cosine as dist

Smaller numbers mean two words are closer together, larger numbers mean they are further apart.

The distance between similar words is low:

In [0]:
dist(vecs[wordidx["puppy"]], vecs[wordidx["dog"]])

0.27636247873306274

In [0]:
dist(vecs[wordidx["queen"]], vecs[wordidx["princess"]])

0.20527541637420654

And the distance between unrelated words is high:

In [0]:
dist(vecs[wordidx["celebrity"]], vecs[wordidx["dusty"]])

0.9883578838780522

In [0]:
dist(vecs[wordidx["kitten"]], vecs[wordidx["airplane"]])

0.8729851841926575

In [0]:
dist(vecs[wordidx["avalanche"]], vecs[wordidx["antique"]])

0.9621107056736946

### Bias

There is a lot of opportunity for bias:

In [0]:
dist(vecs[wordidx["man"]], vecs[wordidx["genius"]])

0.5098515152931213

In [0]:
dist(vecs[wordidx["woman"]], vecs[wordidx["genius"]])

0.689783364534378

Not all pairs are stereotyped:

In [0]:
dist(vecs[wordidx["man"]], vecs[wordidx["emotional"]])

0.5595748424530029

In [0]:
dist(vecs[wordidx["woman"]], vecs[wordidx["emotional"]])

0.6257205307483673

I just checked the distance between pairs of words, because this is a quick and simple way to illustrate the concept.  It is also a very **noisy** approach, and **researchers approach this problem in more systematic ways**.

## Visualizing the words

We will use [Plotly](https://plot.ly/), a Python library to make interactive graphs (note: everything below is done with the free, offline version of Plotly).

### Methods

In [0]:
import plotly
import plotly.graph_objs as go
from plotly.offline import iplot
from IPython.display import IFrame

In [0]:
def enable_plotly_in_cell():
  import IPython
  from plotly.offline import init_notebook_mode
  display(IPython.core.display.HTML('''<script src="/static/components/requirejs/require.js"></script>'''))
  init_notebook_mode(connected=False)

In [0]:
def plotly_3d(Y, cat_labels):
    trace_dict = {}
    for i, label in enumerate(cat_labels):
        trace_dict[i] = go.Scatter3d(
            x=Y[i*5:(i+1)*5, 0],
            y=Y[i*5:(i+1)*5, 1],
            z=Y[i*5:(i+1)*5, 2],
            mode='markers',
            marker=dict(
                size=8,
                line=dict(
                    color='rgba('+ str(i*40) + ',' + str(i*40) + ',' + str(i*40) + ', 0.14)',
                    width=0.5
                ),
                opacity=0.8
            ),
            text = my_words[i*5:(i+1)*5],
            name = label
        )

    data = [item for item in trace_dict.values()]
    layout = go.Layout(
        margin=dict(
            l=0,
            r=0,
            b=0,
            t=0
        )
    )

    plotly.offline.plot({
        "data": data,
        "layout": layout
    })

In [0]:
def plotly_2d(Y, cat_labels):
    trace_dict = {}
    for i, label in enumerate(cat_labels):
        trace_dict[i] = go.Scatter(
            x=Y[i*5:(i+1)*5, 0],
            y=Y[i*5:(i+1)*5, 1],
            mode='markers',
            marker=dict(
                size=8,
                line=dict(
                    color='rgba('+ str(i*40) + ',' + str(i*40) + ',' + str(i*40) + ', 0.14)',
                    width=0.5
                ),
                opacity=0.8
            ),
            text = my_words[i*5:(i+1)*5],
            name = label
        )

    data = [item for item in trace_dict.values()]
    layout = go.Layout(
        margin=dict(
            l=0,
            r=0,
            b=0,
            t=0
        )
    )

    plotly.offline.plot({
        "data": data,
        "layout": layout
    })

### Preparing the Data

Let's plot words from a few different categories:

In [0]:
categories = [
              "bugs", "music", 
              "pleasant", "unpleasant", 
              "science", "arts"
             ]

In [0]:
my_words = [
            "maggot", "flea", "tarantula", "bedbug", "mosquito", 
            "violin", "cello", "flute", "harp", "mandolin",
            "joy", "love", "peace", "pleasure", "wonderful",
            "agony", "terrible", "horrible", "nasty", "failure", 
            "physics", "chemistry", "science", "technology", "engineering",
            "poetry", "art", "literature", "dance", "symphony",
           ]

Again, we need to look up the indices of our words using the wordidx dictionary:

In [0]:
X = np.array([wordidx[word] for word in my_words])

In [0]:
vecs[X].shape

(30, 100)

Now, we will make a set combining our words with the first 10,000 words in our entire set of words (some of the words will already be in there), and create a matrix of their embeddings.

In [0]:
embeddings = np.concatenate((vecs[X], vecs[:10000,:]), axis=0); embeddings.shape

(10030, 100)

### Viewing the words in 3D

The words are in 100 dimensions, so we will need a way to reduce them to 3 dimensions so that we can view them.  Two good options are T-SNE or PCA.  The main idea is to find a meaningful way to go from 100 dimensions to 3 dimensions (while keeping a similar notion of what is close to what).

You would typically just use one of these (T-SNE or PCA).  I've included both if you're interested.

#### TSNE

In [0]:
from sklearn import manifold

In [0]:
tsne = manifold.TSNE(n_components=3, init='pca', random_state=0)
#Y = tsne.fit_transform(subset)
Y = tsne.fit_transform(embeddings)
plotly_3d(Y, categories)

In [0]:
IFrame('temp-plot.html', width=600, height=400)

#### PCA

In [0]:
from sklearn import decomposition

In [0]:
pca = decomposition.PCA(n_components=3).fit(embeddings.T)
#pca = decomposition.PCA(n_components=3).fit(subset.T)
components = pca.components_
plotly_3d(components.T[:len(my_words),:], categories)

In [0]:
IFrame('temp-plot.html', width=600, height=400)

## Nearest Neighbors

We can also see what words are close to a given word.

In [0]:
from sklearn.neighbors import NearestNeighbors

Nearest Neighbors is an algorithm that finds the points closest to a given point.

In [0]:
neigh = NearestNeighbors(n_neighbors=10, radius=0.5, metric='cosine', algorithm='brute')
neigh.fit(vecs) 

NearestNeighbors(algorithm='brute', leaf_size=30, metric='cosine',
                 metric_params=None, n_jobs=None, n_neighbors=10, p=2,
                 radius=0.5)

In [0]:
distances, indices = neigh.kneighbors([vecs[wordidx["feminist"]]])

In [0]:
[(words[int(ind)], dist) for ind, dist in zip(list(indices[0]), list(distances[0]))]

[('feminist', 1.1920929e-07),
 ('feminism', 0.24721634),
 ('feminists', 0.28110862),
 ('activism', 0.34637702),
 ('activist', 0.36203873),
 ('lesbian', 0.3641898),
 ('humanist', 0.37725556),
 ('modernist', 0.3795184),
 ('left-wing', 0.37973535),
 ('postmodern', 0.38574135)]

We can take this a step further, and add two words together.  What is the result?

In [0]:
new_vec = vecs[wordidx["artificial"]] + vecs[wordidx["intelligence"]]

In [0]:
new_vec

array([ 0.0345, -0.1185,  0.746 ,  0.3256,  0.3256, -1.4699, -0.8715,
       -0.9421,  0.0679,  0.922 ,  0.6811, -0.3729,  1.0969,  0.7196,
        1.3515,  1.2493,  0.6621,  0.1901, -0.2707, -0.0444, -1.232 ,
        0.1744,  0.7577, -0.9177, -1.2184,  0.6959, -0.1966, -0.415 ,
       -0.3358,  0.5452,  0.589 , -0.0299, -0.9744, -0.8937,  0.2283,
       -0.2092, -1.3795,  1.7811,  0.2269,  0.47  , -0.3045, -0.1573,
       -0.478 ,  0.3071,  0.4202, -0.4434,  0.1602,  0.1443, -0.9528,
       -0.5565,  0.7537,  0.182 ,  1.4008,  1.8967,  0.595 , -3.0072,
        0.6811, -0.2557,  2.0217,  0.7825,  0.4251,  1.3615,  0.5902,
       -0.1312,  0.9344, -0.5377, -0.3988, -0.6415,  0.6527,  0.5117,
        0.7315,  0.1396,  0.3785, -0.6403, -0.094 ,  0.1076,  0.6197,
        0.2537, -1.4346,  1.169 ,  1.6931,  0.1458, -0.5981,  0.8195,
       -3.1903,  1.2429,  2.1481,  1.6004,  0.2014, -0.2121,  0.3698,
       -0.001 , -0.628 ,  0.2869,  0.3119, -0.1093, -0.6341, -1.7804,
        0.5857,  0.3

In [0]:
distances, indices = neigh.kneighbors([new_vec])

In [0]:
[(words[int(ind)], dist) for ind, dist in zip(list(indices[0]), list(distances[0]))]

[('intelligence', 0.18831605),
 ('artificial', 0.25617576),
 ('information', 0.3256532),
 ('knowledge', 0.33641893),
 ('secret', 0.36480355),
 ('human', 0.3672669),
 ('biological', 0.37090683),
 ('using', 0.3773631),
 ('scientific', 0.385139),
 ('communication', 0.38691515)]

In [0]:
distances, indices = neigh.kneighbors([vecs[wordidx["king"]]])

In [0]:
[(words[int(ind)], dist) for ind, dist in zip(list(indices[0]), list(distances[0]))]

[('king', 0.0),
 ('prince', 0.23176706),
 ('queen', 0.24923098),
 ('son', 0.2979113),
 ('brother', 0.30142248),
 ('monarch', 0.30221093),
 ('throne', 0.30800098),
 ('kingdom', 0.31885898),
 ('father', 0.3197971),
 ('emperor', 0.32871425)]

In [0]:
new_vec = vecs[wordidx["king"]] - vecs[wordidx["he"]] + vecs[wordidx["she"]]

In [0]:
distances, indices = neigh.kneighbors([new_vec])

In [0]:
[(words[int(ind)], dist) for ind, dist in zip(list(indices[0]), list(distances[0]))]

[('king', 0.13275808),
 ('queen', 0.16259885),
 ('princess', 0.24821734),
 ('daughter', 0.29121184),
 ('prince', 0.29464376),
 ('elizabeth', 0.29630506),
 ('mother', 0.3091293),
 ('sister', 0.31979597),
 ('father', 0.34473372),
 ('throne', 0.34474844)]

In [0]:
wordidx["programmer"]

19226

In [0]:
wordidx["student"]

1283

In [0]:
distances, indices = neigh.kneighbors([vecs[wordidx["programmer"]]])

Closest words to "programmer":

In [0]:
[(words[int(ind)], dist) for ind, dist in zip(list(indices[0]), list(distances[0]))]

[('programmer', 1.1920929e-07),
 ('programmers', 0.32259798),
 ('animator', 0.36951023),
 ('software', 0.38250887),
 ('computer', 0.40600342),
 ('technician', 0.41406858),
 ('engineer', 0.4303757),
 ('user', 0.4356534),
 ('translator', 0.43721008),
 ('linguist', 0.44948018)]

Feminine version of "programmer"

In [0]:
new_vec = vecs[wordidx["programmer"]] - vecs[wordidx["he"]] + vecs[wordidx["she"]]

In [0]:
distances, indices = neigh.kneighbors([new_vec])

In [0]:
[(words[int(ind)], dist) for ind, dist in zip(list(indices[0]), list(distances[0]))]

[('programmer', 0.1950342),
 ('stylist', 0.42715955),
 ('animator', 0.4820645),
 ('programmers', 0.48337305),
 ('choreographer', 0.4862678),
 ('technician', 0.48628056),
 ('designer', 0.48710012),
 ('prodigy', 0.49118334),
 ('lets', 0.49730027),
 ('screenwriter', 0.49754214)]

Masculine version of "programmer"

In [0]:
new_vec = vecs[wordidx["programmer"]] - vecs[wordidx["she"]] + vecs[wordidx["he"]]

In [0]:
distances, indices = neigh.kneighbors([new_vec])

In [0]:
[(words[int(ind)], dist) for ind, dist in zip(list(indices[0]), list(distances[0]))]

[('programmer', 0.17419636),
 ('programmers', 0.4133587),
 ('engineer', 0.46376413),
 ('compiler', 0.467317),
 ('software', 0.4681465),
 ('animator', 0.4892366),
 ('computer', 0.5046158),
 ('mechanic', 0.5150068),
 ('setup', 0.51882535),
 ('developer', 0.51953185)]

In [0]:
distances, indices = neigh.kneighbors([vecs[wordidx["doctor"]]])

In [0]:
[(words[int(ind)], dist) for ind, dist in zip(list(indices[0]), list(distances[0]))]

[('doctor', 5.9604645e-08),
 ('physician', 0.23267603),
 ('nurse', 0.24784917),
 ('dr.', 0.28248066),
 ('doctors', 0.29191148),
 ('patient', 0.29258162),
 ('medical', 0.30040073),
 ('surgeon', 0.30946612),
 ('hospital', 0.30990708),
 ('psychiatrist', 0.3410902)]

Feminine version of doctor:

In [0]:
new_vec = vecs[wordidx["doctor"]] - vecs[wordidx["he"]] + vecs[wordidx["she"]]

In [0]:
distances, indices = neigh.kneighbors([new_vec])

In [0]:
[(words[int(ind)], dist) for ind, dist in zip(list(indices[0]), list(distances[0]))]

[('doctor', 0.13456273),
 ('nurse', 0.22582495),
 ('mother', 0.27610385),
 ('woman', 0.29901665),
 ('pregnant', 0.32096934),
 ('girl', 0.3324105),
 ('patient', 0.34357917),
 ('she', 0.35723114),
 ('child', 0.3631252),
 ('herself', 0.36338794)]

Masculine version of doctor:

In [0]:
new_vec = vecs[wordidx["doctor"]] - vecs[wordidx["she"]] + vecs[wordidx["he"]]

In [0]:
distances, indices = neigh.kneighbors([new_vec])

In [0]:
[(words[int(ind)], dist) for ind, dist in zip(list(indices[0]), list(distances[0]))]

[('doctor', 0.15277696),
 ('physician', 0.27226865),
 ('medical', 0.37674332),
 ('he', 0.37695646),
 ('doctors', 0.38290107),
 ('dr.', 0.38466895),
 ('surgeon', 0.39124882),
 ('him', 0.40270942),
 ('hospital', 0.42226428),
 ('himself', 0.42476082)]

## Bias

Again, just looking at individual words is a **noisy** approach (I'm using it as a simple illustration).  [Researchers from Princeton and University of Bath](https://www.princeton.edu/~aylinc/papers/caliskan-islam_semantics.pdf) use **small baskets of terms** to represent concepts.  They first confirmed that flowers are more pleasant than insects, and musical instruments are more pleasant from weapons.

They then found that European American names are "more pleasant" than African American names, as captured by how close the word vectors are (as embedded by GloVe, which is a library from Stanford, along the same lines as Word2Vec).

    We show for the first time that if AI is to exploit via our language the vast 
    knowledge that culture has compiled, it will inevitably inherit human-like 
    prejudices. In other words, if AI learns enough about the properties of language 
    to be able to understand and produce it, it also acquires cultural associations 
    that can be offensive, objectionable, or harmful.

[Researchers from Boston University and Microsoft Research](https://arxiv.org/pdf/1606.06121.pdf) found the pairs most analogous to *He : She*.  They found gender bias, and also proposed a way to debias the vectors.

Rob Speer, CTO of Luminoso, tested for ethnic bias by finding correlations for a list of positive and negative words:

    The tests I implemented for ethnic bias are to take a list of words, such as 
    “white”, “black”, “Asian”, and “Hispanic”, and find which one has the strongest 
    correlation with each of a list of positive and negative words, such as “cheap”, 
    “criminal”, “elegant”, and “genius”. I did this again with a fine-grained version 
    that lists hundreds of words for ethnicities and nationalities, and thus is more 
    difficult to get a low score on, and again with what may be the trickiest test of 
    all, comparing words for different religions and spiritual beliefs.

**Ways to address bias**

There are a few different approaches:

- Debias word embeddings
  - [Technique in Bolukbasi, et al.](https://arxiv.org/abs/1606.06121)
  - [ConceptNet Numberbatch (Rob Speer)](https://blog.conceptnet.io/2017/04/24/conceptnet-numberbatch-17-04-better-less-stereotyped-word-vectors/)
- Argument that “awareness is better than blindness”: debiasing should happen at time of action, not at perception. ([Caliskan-Islam, Bryson, Narayanan](https://www.princeton.edu/~aylinc/papers/caliskan-islam_semantics.pdf))

Either way, you need to be on the lookout for bias and have a plan to address it!

If you are interested in the topic of bias in AI, I gave a workshop [you can watch here](https://www.youtube.com/watch?v=25nC0n9ERq4) that covers this material and goes into more depth about bias.

This demo has been adapted (and simplified) from part of Lesson 5 of [Practical Deep Learning for Coders](http://course.fast.ai/index.html)