<center><h1>Chapter 7 - Word vectors and word embeddings</h1></center>

First, I want to again give a shoutout to Allison Parrish (http://www.decontextualize.com/). She mixes machine learning with poetry (!) I really like her work and explanations. I borrowed her color mixing example.

The topic I want to take up this week is something called "embeddings". The general idea is to take some complex structure and reduce it to a vector encoding. And the goal is for the vector to be low_dimensional and compact (non-sparse). So in the hundreds range, not the thousands range. And most importantly, the vector captures "features" of the structure in a way that structures that are similar have similar vectors.

Our interest is in text so we will look at how to "embed" text into a vector. But the applications of embedding are much broader. In our department for example, researchers are attempting to find embeddings for social networks: ways to encode a complex graph structure into a vector. I have seen something similar in Biology, taking Biological networks and reducing them to a vector representation. In fact, looking at your majors, I believe I can find researchers in each looking at embeddings.

One of the attributes of embeddings is that we can use similarity functions on the vectors. Thus we can find social networks that are similar, biological networks that are similar, students that are similar. We will see that with words that are similar in this chapter.

#1. Animal similarity

I'm going to kind of sneak up on the word embedding idea.
We'll begin by considering a small subset of English: words for animals. Our task will be to  find similarities among these words and the creatures they designate. So we will look for similarities between `kitten` and `hamster`. I will come up with 2 features that I think are important for animals. See below:

![Animal spreadsheet](http://static.decontextualize.com/snaps/animal-spreadsheet.png)

This spreadsheet associates a handful of animals with two numbers: their cuteness and their size, both in a range from zero to one hundred. The values themselves are simply based on personal judgment. Your taste in cuteness and evaluation of size may differ significantly from mine. As with all data, these data are simply a mirror reflection of the person who collected them. In other words, always have to be aware of human bias in problems like this.

This is a step toward what is called a word-vector. For instance the word `kitten` has a 2d vector 95,15.

Let's build the pandas version of the table shown. I'm going to use a pandas method that will convert a list of row values into a table.


In [None]:
rows = [
    ['kitten', 95,15],
    ['hamster', 80,8],
    ['tarantula', 8,3],
    ['puppy', 90, 20],
    ['crocodile', 5, 40],
    ['dolphin', 60,45],
    ['panda bear', 75, 40],
    ['lobster', 2, 15],
    ['capybara', 70, 30],
    ['elephant', 65, 90],
    ['mosquito', 1, 1],
    ['goldfish', 25, 2],
    ['horse', 50, 50],
    ['chicken', 25, 15]
]

In [None]:
columns = ['animal', 'cuteness', 'size']

In [None]:
import pandas as pd
animal_table = pd.DataFrame.from_records(rows, columns=columns)  #provide rows and column names
animal_table = animal_table.set_index(['animal'])  #make animal column the index
animal_table

Unnamed: 0_level_0,cuteness,size
animal,Unnamed: 1_level_1,Unnamed: 2_level_1
kitten,95,15
hamster,80,8
tarantula,8,3
puppy,90,20
crocodile,5,40
dolphin,60,45
panda bear,75,40
lobster,2,15
capybara,70,30
elephant,65,90



The values in the table give us a way to make determinations about which animals are similar. For example, try to answer the following question:

 *Which animal is most similar to a `capybara`?*
 
  You could go through the values one by one and use euclidean distance (or cosine similarity) to make that evaluation. This is quite similar to what we were doing with `ordered_distances`.

Let's try visualizing the data as points in 2-dimensional space:

![Animal space](http://static.decontextualize.com/snaps/animal-space.png)


##Bring in puddles now

We need to use Euclidean distance.

In [None]:
#flush the old directory
!rm -r  'uo_puddles'

rm: cannot remove 'uo_puddles': No such file or directory


In [None]:
my_github_name = 'uo-puddles'  #replace with your account name

In [None]:
#clone_url = f'https://github.com/{my_github_name}/w20_ds_library.git'
clone_url = f'https://github.com/{my_github_name}/uo_puddles.git'

In [None]:
#get the latest.
!git clone $clone_url 


Cloning into 'uo_puddles'...
remote: Enumerating objects: 231, done.[K
remote: Counting objects: 100% (231/231), done.[K
remote: Compressing objects: 100% (195/195), done.[K
remote: Total 231 (delta 137), reused 64 (delta 33), pack-reused 0[K
Receiving objects: 100% (231/231), 58.17 KiB | 6.46 MiB/s, done.
Resolving deltas: 100% (137/137), done.


In [None]:
import uo_puddles.uo_puddles as up

It looks to me like capybara is closest to panda. But given a graph representation, I can actually put a number on that "closeness".


In [None]:
up.euclidean_distance([70, 30], [75, 40]) # panda and capybara  11.180339887498949

11.180339887498949

Looking again at the graph, "tarantula" and "elephant" look far away. Again, we can put a number on this.

In [None]:
up.euclidean_distance([8, 3], [65, 90]) # tarantula and elephant  104.0096149401583

104.0096149401583

Modeling animals in this way has interesting properties. For example, you can pick an arbitrary point in "animal space" and then find the animal closest to that point. If you imagine an animal of size 25 and cuteness 30, you can easily look at the space to find the animal that most closely fits that description: the chicken.



I am going to write a special function to work with the animal table. I could probably rework my existing `ordered_distances` function, but decided easier to write this new one.

In [None]:
def ordered_embeddings(target_vector, table):
  names = table.index.tolist()
  ordered_list = []
  for i in range(len(names)):
    name = names[i]
    row = table.loc[name].tolist()
    d = up.euclidean_distance(target_vector, row)
    ordered_list.append([d, names[i]])
  ordered_list = sorted(ordered_list)

  return ordered_list

In [None]:
pup = animal_table.loc['puppy'].tolist()

ordered_embeddings(pup, animal_table)

[[0.0, 'puppy'],
 [7.0710678118654755, 'kitten'],
 [15.620499351813308, 'hamster'],
 [22.360679774997898, 'capybara'],
 [25.0, 'panda bear'],
 [39.05124837953327, 'dolphin'],
 [50.0, 'horse'],
 [65.19202405202648, 'chicken'],
 [67.446274915669, 'goldfish'],
 [74.33034373659252, 'elephant'],
 [83.74365647617735, 'tarantula'],
 [87.32124598286491, 'crocodile'],
 [88.14193099768123, 'lobster'],
 [91.00549433962765, 'mosquito']]

Let's look at it  geometrically. You can  answer questions like: what's halfway between a chicken and an elephant? Simply draw a line from "elephant" to "chicken," mark off the midpoint and find the closest animal. (According to our chart, halfway between an elephant and a chicken is a horse.) Let's check that out computationally.



In [None]:
elephant = animal_table.loc['elephant'].tolist()
chicken = animal_table.loc['chicken'].tolist()
half_vector = [(e+c)/2  for e,c in zip(elephant, chicken)]  #using fancy list-building version
half_vector  #[45.0, 52.5]

[45.0, 52.5]

Sorry, I am using a fancy version of my new list from old list gist. I could have written a loop, but wanted a one-liner.
<pre>
half_vector = [(e+c)/2  for e,c in zip(elephant, chicken)]
</pre>
I would not worry too much about it for now. I am going to ask you to write a function that does the same in just a bit.

In [None]:
ordered_embeddings(half_vector, animal_table)

[[5.5901699437494745, 'horse'],
 [16.77050983124842, 'dolphin'],
 [32.5, 'panda bear'],
 [33.63406011768428, 'capybara'],
 [41.907636535600524, 'crocodile'],
 [42.5, 'chicken'],
 [42.5, 'elephant'],
 [54.31620384378864, 'goldfish'],
 [55.509008277936296, 'puppy'],
 [56.61492736019362, 'hamster'],
 [57.05479822065801, 'lobster'],
 [61.80008090609591, 'tarantula'],
 [62.5, 'kitten'],
 [67.73662229547618, 'mosquito']]

You can also ask: what's the *difference* between a hamster and a tarantula? According to our plot, it's about seventy five units of cute (and a few units of size).

The relationship of "difference" is an interesting one, because it allows us to reason about *analogous* relationships. In the chart below, I've drawn an arrow from "tarantula" to "hamster" (in blue):

![Animal analogy](http://static.decontextualize.com/snaps/animal-space-analogy.png)

You can understand this arrow as being the *relationship* between a tarantula and a hamster, in terms of their size and cuteness (i.e., hamsters and tarantulas are about the same size, but hamsters are much cuter). In the same diagram, I've also transposed this same arrow (this time in red) so that its origin point is "chicken." The arrow ends closest to "kitten." What we've discovered is that the animal that is about the same size as a chicken but much cuter is... a kitten. To put it in terms of an analogy:

    Tarantulas are to hamsters as chickens are to kittens.
    


In [None]:
tarantula = animal_table.loc['tarantula'].tolist()
hamster = animal_table.loc['hamster'].tolist()
thd = up.euclidean_distance(tarantula, hamster)  #tarantula hamster distance
thd  #72.17340230306452

72.17340230306452

Now get animal distances from chicken and find the one closest to 72.17.

In [None]:
chick_ds = ordered_embeddings(chicken, animal_table)
chick_ds

[[0.0, 'chicken'],
 [13.0, 'goldfish'],
 [20.808652046684813, 'tarantula'],
 [23.0, 'lobster'],
 [27.784887978899608, 'mosquito'],
 [32.01562118716424, 'crocodile'],
 [43.01162633521314, 'horse'],
 [46.09772228646444, 'dolphin'],
 [47.43416490252569, 'capybara'],
 [55.44366510251645, 'hamster'],
 [55.90169943749474, 'panda bear'],
 [65.19202405202648, 'puppy'],
 [70.0, 'kitten'],
 [85.0, 'elephant']]

In [None]:
min(chick_ds, key=lambda x:abs(x[0]-thd))  #('kitten', 70.0)

[70.0, 'kitten']

The above code looks kind of complex. Normally you would give the min function a list of values and it would find the smallest. But you can also give it a 2nd, optional argument: how you want the comparison to be made. So you have:

<pre>
key=lambda x:abs(x[0]-thd)
</pre>

In words, we want the minimum value when looking at the distance (i.e., x[0]) minus 72.17 (i.e., thd). And by the way, take the absolute value of that difference. So this should find the pair who has a distance from chicken that is the closest to 72.17.

Kind of a brief explanation, right?. But I am hoping you will just go with it for now. I won't expect you to come up with this kind of code in your own programs. We could use a loop to do it, but it would be kind of complex. Check it out.


In [None]:
#loop equiv of min(chick_ds, key=lambda x:abs(x[0]-thd))  #('kitten', 70.0)

d1, n1 = chick_ds[1]  #just to get us started
the_min = [abs(d1-thd), 1]
for i in range(2,len(chick_ds)):
  d,n = chick_ds[i]
  chick_diff = abs(d-thd)
  if chick_diff < the_min[0]:
    the_min = [chick_diff, i]

the_winner = chick_ds[the_min[1]]
the_winner

[70.0, 'kitten']

#2. Language with vectors: colors

So far, so good. We have a system in place—albeit highly subjective—for talking about animals and the words used to name them. I want to talk about another vector space that has to do with language: the vector space of colors.

Colors are often represented in computers as vectors with three dimensions: red, green, and blue. Just as with the animals in the previous section, we can use these vectors to answer questions like: which colors are similar? What's the most likely color name for an arbitrarily chosen set of values for red, green and blue? Given the names of two colors, what's the name of those colors' "average"?

We'll be working with this [color data](https://github.com/dariusk/corpora/blob/master/data/colors/xkcd.json) from the [xkcd color survey](https://blog.xkcd.com/2010/05/03/color-survey-results/). The data relates a color name to the RGB value associated with that color. [Here's a page that shows what the colors look like](https://xkcd.com/color/rgb/).

I've put it in a table for us.

In [None]:
url = 'https://docs.google.com/spreadsheets/d/e/2PACX-1vT7WNqkgfIYL5AWgb8aGSGhvh3wo-JwTQlJzN1Y2LYH09fzLtfeKHMDau9s6PcOBwU01-DfbPuEzhTZ/pub?output=csv'

In [None]:
color_table = pd.read_csv(url, encoding='utf-8', dtype={'color':str}, na_filter=False)
color_table = color_table.set_index(['color'])


In [None]:
color_table.head()

Unnamed: 0_level_0,hex,red,green,blue
color,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
acid green,#8ffe09,143,254,9
adobe,#bd6c48,189,108,72
algae,#54ac68,84,172,104
algae green,#21c36f,33,195,111
almost black,#070d0d,7,13,13


The following function converts colors from hex format (`#1a2b3c`) to a tuple of integers:

In [None]:
def hex_to_int(s:str) -> tuple:
  assert isinstance(s, str), f's must be a string but is instead a {type(s)}'
  assert len(s) == 7, f's must be 7 characters long but is instead {len(s)}'
  assert s[0] == '#', f's must start with a # but instead starts with an {s[0]}'

  s = s.lstrip("#")  #strip # off of left-hand side of string
  red = int(s[:2], 16)
  green = int(s[2:4], 16)
  blue = int(s[4:6], 16)
  return [red, green, blue]

I added type hints and asserts to help you remember what the function expects as input.

In [None]:
hex_to_int('#8ffe09')  #[143, 254, 9] - matches what we see in table

[143, 254, 9]

##What the heck is "hex"?

Computer scientists (and color mixers!) like to play around with different number bases. We all know base 10 to include the digits 0-9. Well, hex (hexidecimal) is base 16. It includes the digits 0-f. I suppose if we had to work with aliens that had 8 fingers per hand, we would have to get good at hex arithmetic. It so happens that the computer often stores numbers in hex format as well. It typically does the hex to decimal translation for you. But not always :)

Python has a function for converting a hex number (as a string) into a decimal equivalent. Check it out.

In [None]:
int('8f', 16)  #give decimal equiv of 8f in base 16, i.e., hex

143

We can go the other way. You can ignore the 0x prefix.

In [None]:
hex(143)  #give hex version of decimal 143

'0x8f'

We won't be using hex going forward but thought you might like to know what it is, at least at a high level.

##Check out cosine differences

Reminder: cosine ranges from 0 to 1 with 1 being an exact match.

##Wrangling: drop hex column

We don't need it. We already have the equivalent red, green, and blue values. And btw, these are in decimal.

In [None]:
color_table = color_table.drop(['hex'], axis=1) #need axis=1 to say we are dropping a column
color_table.head()

Unnamed: 0_level_0,red,green,blue
color,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
acid green,143,254,9
adobe,189,108,72
algae,84,172,104
algae green,33,195,111
almost black,7,13,13


In [None]:
olive_vector = color_table.loc['olive'].tolist()
olive_vector


[110, 117, 14]

In [None]:
red_vector = color_table.loc['red'].tolist()
red_vector


[229, 0, 0]

In [None]:
up.cosine_similarity(olive_vector, red_vector)  #0.6823879113063314

0.6823879113063314

##Closest colors

In [None]:
ordered_embeddings(red_vector, color_table)[:10]  #closest colors to red

[[0.0, 'red'],
 [25.079872407968907, 'fire engine red'],
 [29.068883707497267, 'bright red'],
 [45.552167895721496, 'tomato red'],
 [45.73838650411709, 'cherry red'],
 [46.33573135281238, 'scarlet'],
 [53.563046963368315, 'vermillion'],
 [56.2672195865408, 'orangish red'],
 [56.49778756730214, 'cherry'],
 [59.84981202978001, 'lipstick red']]

#Assignment 1
<img src='https://www.dropbox.com/s/3uyvp722kp5to2r/assignment.png?raw=1' width='300'>

I'd like to start messing around with "color arithmetic". I think you will see it is kind of cool. I'll ask you to answer questions like this:

* What color do you get by subtracting "red" from "purple"?

* What's blue plus green?

* Find the average of black and white?

* An analogy: pink is to red as X is to blue.

* Another analogy: Navy is to blue as X is to green.

I am going to ask you to write a collection of functions to help answer these questions. I'll give you the first below. I am going to include type hints and asserts. **YOU DO NOT NEED TO INCLUDE THESE IN YOUR FUNCTIONS.** They can be intimidating when just getting your feet wet. But feel free to try your hand if you are feeling adventurous.

Here you go. A function for subtracting 2 vectors of any length as long as they are of equal length.

You can also see that I am giving you the one-line version of the function body as a comment. You can ignore it, no problem. But if you can get your head around it, I think it is a safer approach. The less lines of code you have to write, the less places to introduce bugs.

In [None]:
def subtractv(x:list, y:list) -> list:
  assert isinstance(x, list), f"x must be a list but instead is {type(x)}"
  assert isinstance(y, list), f"y must be a list but instead is {type(y)}"
  assert len(x) == len(y), f"x and y must be the same length"

  #result = [(c1 - c2) for c1, c2 in zip(x, y)]  #one-line compact version - called a list comprehension

  result = []
  for i in range(len(x)):
    c1 = x[i]
    c2 = y[i]
    result.append(c1-c2)

  return result

In [None]:
subtractv([5, 10, 20],[1, 2, 3])  #[4, 8, 17]

[4, 8, 17]

##Step 1.

Please define addv. Use subtractv as a template.

In [None]:
def addv(x:list, y:list) -> list:
  assert isinstance(x, list), f"x must be a list but instead is {type(x)}"
  assert isinstance(y, list), f"y must be a list but instead is {type(y)}"
  assert len(x) == len(y), f"x and y must be the same length"

  #result = [c1 + c2 for c1, c2 in zip(x, y)]  #one-line compact version

  result = []
  for i in range(len(x)):
    c1 = x[i]
    c2 = y[i]
    result.append(c1+c2)

  return result

In [None]:
addv([5, 10, 20],[1, 2, 3])  #[6, 12, 23]

[6, 12, 23]

##Step 2.

Please define dividev. This function takes a list and a number and divides every element of the list by the number.

In [None]:
def dividev(x:list, c) -> list:
  assert isinstance(x, list), f"x must be a list but instead is {type(x)}"
  assert isinstance(c, int) or isinstance(c, float), f"c must be an int or a float but instead is {type(c)}"

  #result = [v/c for v in x]  #one-line compact version

  result = []
  for i in range(len(x)):
    v = x[i]
    result.append(v/c) #division produces a float

  return result

In [None]:
dividev([2, 10, 20], 2)  #[1.0, 5.0, 10.0]

[1.0, 5.0, 10.0]

##Step 3.

This one is move challenging. I would like the mean vector from a matrix. As reminder, a matrix is a list of vectors. So add all the vectors up then divide each element by the length of the matrix. Please use `addv` and `dividev` in your function body.

In [None]:
def meanv(matrix: list) -> list:
    assert isinstance(matrix, list), f"matrix must be a list but instead is {type(x)}"
    assert len(matrix) >= 1, f'matrix must have at least one row'

    #Python transpose: sumv = [sum(col) for col in zip(*matrix)]

    sumv = matrix[0]  #use first row as starting point in "reduction" style
    for row in matrix[1:]:   #make sure start at row index 1 and not 0
      sumv = addv(sumv, row)
    mean = dividev(sumv, len(matrix))
    return mean


In [None]:
test = [[0, 1], [2, 2], [4, 3]]  #test matrix

In [None]:
meanv(test)  #[2.0, 2.0]

[2.0, 2.0]

A little more on fancy version. Skip if you like.

The form `zip(*matrix)` is a gist that stands for matrix transpose. I'll show you by tranposing the matrix `test` from `3x2` to a `2x3`.

In [None]:
testz = list(zip(*test))  #transforms into a 2x3 matrix.
testz  #look at test and see what has happened.

[(0, 2, 4), (1, 2, 3)]

In [None]:
the_sum = [sum(row) for row in testz]  #can use plain sum given column is now in list form
the_sum

[6, 6]

In [None]:
dividev(the_sum, len(test))  #matches

[2.0, 2.0]

One more way. Thanks to Tara for googling around for this. She took note of my mention of the "reduce" approach and found a genuine `reduce` function.

It is interesting that the `reduce` function was included in Python 2.7 but moved out in 3.6. Not sure why. I think it is useful.

In [None]:
from functools import reduce  #just want this one function from the library

the_sum = reduce(addv, test)  #add each of the pairs in test together using addv
the_sum

[6, 6]

##Step 4.

Please answer this question:

Find the ten colors closest to the average of 'black' and 'white'.



##Step 5.

Please answer this question:

Find the 10 colors closest to what you get by subtracting "red" from "purple".


##Step 6.

Please answer this question:


What are the 10 colors closest to blue plus green?



##Step 7.

Please answer this question:

An analogy: pink is to red as X is to blue. What are the 10 best colors to choose for X?

This one is a little trickier. Let me see if I can restate it. I am trying to solve this equation where the tilda stands for *roughly equal to*.

<pre>
(pink - red) ~ (X - blue)
 or
(pink - red) + blue ~ X
</pre>
What I get:
<pre>
[[163.29727493133498, 'neon blue'],
 [163.44418007380992, 'bright sky blue'],
 [170.0764533967004, 'bright light blue'],
 [172.97976760303501, 'cyan'],
 [174.54512310574592, 'bright cyan'],
 [176.39727889057698, 'bright turquoise'],
 [178.23860412379804, 'clear blue'],
 [178.54131174604942, 'azure'],
 [178.92456511055155, 'dodger blue'],
 [180.95303258028034, 'lightish blue']]
 </pre>

In [None]:
# an analogy: pink is to red as X is to blue
pink_to_red = subtractv(color_table.loc['pink'].tolist(), color_table.loc['red'].tolist())
addin = addv(pink_to_red, color_table.loc['blue'].tolist())
ordered_embeddings(addin, color_table)[:10]

[[163.29727493133498, 'neon blue'],
 [163.44418007380992, 'bright sky blue'],
 [170.0764533967004, 'bright light blue'],
 [172.97976760303501, 'cyan'],
 [174.54512310574592, 'bright cyan'],
 [176.39727889057698, 'bright turquoise'],
 [178.23860412379804, 'clear blue'],
 [178.54131174604942, 'azure'],
 [178.92456511055155, 'dodger blue'],
 [180.95303258028034, 'lightish blue']]

##Step 8.

Please answer this question:

Another analogy: Navy is to blue as X is to green. What are the 10 best colors to choose for X?

What I get:
<pre>
[[140.59160714637272, 'true green'],
 [143.85409274678284, 'dark grass green'],
 [147.770091696527, 'grassy green'],
 [148.82540105774956, 'racing green'],
 [151.07944929738127, 'forest'],
 [151.52887513606112, 'bottle green'],
 [153.4079528577316, 'dark olive green'],
 [153.6522046701576, 'darkgreen'],
 [154.042202009709, 'forrest green'],
 [154.52184311611094, 'grass green']]
 </pre>

In [None]:

navy_to_blue = subtractv(color_table.loc['navy'].tolist(), color_table.loc['blue'].tolist())
addin = addv(navy_to_blue, color_table.loc['green'].tolist())
ordered_embeddings(addin, color_table)[:10]

[[140.59160714637272, 'true green'],
 [143.85409274678284, 'dark grass green'],
 [147.770091696527, 'grassy green'],
 [148.82540105774956, 'racing green'],
 [151.07944929738127, 'forest'],
 [151.52887513606112, 'bottle green'],
 [153.4079528577316, 'dark olive green'],
 [153.6522046701576, 'darkgreen'],
 [154.042202009709, 'forrest green'],
 [154.52184311611094, 'grass green']]

##Step 9.

Please answer this question:

Throw all the colors together. Take the average of all the colors. What are the 10 colors closest to the average vector you get?

Here is what I get:
<pre>
[[4.602553484191201, 'brown grey'],
 [21.139113217848376, 'reddish grey'],
 [21.428001832576594, 'brownish grey'],
 [24.206525784721993, 'medium grey'],
 [25.59966041635289, 'green grey'],
 [26.10522478271917, 'warm grey'],
 [27.96545249675546, 'dark khaki'],
 [30.709594237593613, 'grey green'],
 [32.567254561942164, 'grey/green'],
 [33.166059500486384, 'greeny grey']]
 </pre>

In [None]:
matrix = []

for name in color_table.index.values.tolist():
  matrix.append(color_table.loc[name].tolist())

In [None]:
matrix[:5]

[[143, 254, 9], [189, 108, 72], [84, 172, 104], [33, 195, 111], [7, 13, 13]]

In [None]:
all_average = meanv(matrix)
all_average

[141.5690200210748, 134.3119072708114, 107.93888303477344]

In [None]:
ordered_embeddings(all_average, color_table)[:10]

[[4.602553484191201, 'brown grey'],
 [21.139113217848376, 'reddish grey'],
 [21.428001832576594, 'brownish grey'],
 [24.206525784721993, 'medium grey'],
 [25.59966041635289, 'green grey'],
 [26.10522478271917, 'warm grey'],
 [27.96545249675546, 'dark khaki'],
 [30.709594237593613, 'grey green'],
 [32.567254561942164, 'grey/green'],
 [33.166059500486384, 'greeny grey']]

##Bold statement

I claim that above demonstrates that it's possible to use math to reason about how people use language.

#3. Doing bad digital humanities with color vectors

With the tools above in hand, we can start using our vectorized knowledge of language toward academic ends. In the following example, I'm going to calculate the average color of Bram Stoker's *Dracula*.

We will definitely need spacy so let's bring that in.

In [None]:
import spacy


The following code will tell spacy to use something called a Graphics Processing Unit (GPU) if it is available. Otherwise, work without it.

If you are so inclined, you can turn-on colab's GPU under Runtime/Change runtime type. The problem is that this will restart your kernel so will have to run all cells again above this. Instead, you could leave yourself a note to change it when you first start the notebook up. It will stay changed so only need to do it once (per notebook).

What good is a GPU? It's kind of complicated. The short answer is that it *may* make your spacy code run faster. But you can also run without it, maybe just a tad slower.

In [None]:
spacy.prefer_gpu()  #True if have GPU turned on, False if you just want to run normally


True

In [None]:
!python -m spacy download en_core_web_md

Collecting en_core_web_md==2.2.5
[?25l  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_md-2.2.5/en_core_web_md-2.2.5.tar.gz (96.4MB)
[K     |████████████████████████████████| 96.4MB 1.3MB/s 
Building wheels for collected packages: en-core-web-md
  Building wheel for en-core-web-md (setup.py) ... [?25l[?25hdone
  Created wheel for en-core-web-md: filename=en_core_web_md-2.2.5-cp36-none-any.whl size=98051305 sha256=be13e8d878382b3f8c1356f1f309a52efd414381c60e12fd3951de151ffd798c
  Stored in directory: /tmp/pip-ephem-wheel-cache-r9vdn19y/wheels/df/94/ad/f5cf59224cea6b5686ac4fd1ad19c8a07bc026e13c36502d81
Successfully built en-core-web-md
Installing collected packages: en-core-web-md
Successfully installed en-core-web-md-2.2.5
[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_md')


In [None]:
import en_core_web_md
nlp = en_core_web_md.load()  #Gives us a way to parse text documents in one line of code. You will see in minute.

In [None]:
#spnlp = TypeVar('spacy.lang.en.English')  #for type hints

In [None]:
nlp('This is a test sentence.')

This is a test sentence.

Let's do a little exploration of what we 
just loaded.

In [None]:
nlp.vocab.length  #1.3M words - think I told you this was 20K. Not sure where I got that idea!

1340241

Jargon alert. A "dunder" method (see below) is a method that starts and ends with 2 underscores. The "d" stands for "double", the "under" for underscores.

Forcing you to use these semi-arcane methods is another failing of spacy, IMHO.

In [None]:
nlp.vocab.__contains__('marvelous')  #dunder method for checking if word in vocab

True

In [None]:
nlp.vocab.__contains__('askfds')

False

#4. The color of books

Let's check out the "color" of some classic books. To calculate the average color of an entire book, we'll follow these steps:

1. Parse the book text into words using spacy's `nlp` method.
2. Check every word to see if it names a color in our vector space, i.e., the color_table. If it does, add it to a list of vectors.
3. Find the average of that list of vectors.
4. Find the color(s) closest to that average vector.

I'm going to set up links to 3 classic novels. All of the links point to text copies of books maintained by Project Gutenberg. If you have a favorite book that has outrun its copyright, you may find it at gutenberg.org.

In [None]:
dracula_url = 'http://www.gutenberg.org/cache/epub/345/pg345.txt'
dickens_url = 'https://www.gutenberg.org/files/98/98-0.txt'  #tale of two cities
yellow_url = 'http://www.gutenberg.org/files/1952/1952-0.txt'

These are text files and not csv files. We are going to have to go through a few more steps to load them into Python.

First we will use a command line operation (denoted by the bang that starts it). This will bring the file into temporary storage in colab. This does **not** bring it into Google Drive. And it won't stay in temporary storage forever.

In [None]:
!wget {dracula_url}

--2020-05-13 16:05:50--  http://www.gutenberg.org/cache/epub/345/pg345.txt
Resolving www.gutenberg.org (www.gutenberg.org)... 152.19.134.47, 2610:28:3090:3000:0:bad:cafe:47
Connecting to www.gutenberg.org (www.gutenberg.org)|152.19.134.47|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 883160 (862K) [text/plain]
Saving to: ‘pg345.txt’


2020-05-13 16:05:50 (2.69 MB/s) - ‘pg345.txt’ saved [883160/883160]



Check to make sure it is there using another command line operation ls -l.

In [None]:
!ls -l

total 872
-rw-r--r-- 1 root root 883160 May  1 07:50 pg345.txt
drwxr-xr-x 1 root root   4096 May  4 16:26 sample_data
drwxr-xr-x 4 root root   4096 May 13 16:04 uo_puddles


Now can read it into a string.

In [None]:
with open('pg345.txt', 'r') as f:
  dracula = f.read()

In [None]:
type(dracula)

str

In [None]:
dracula[:100]  #first 100 characters

'\ufeffThe Project Gutenberg EBook of Dracula, by Bram Stoker\n\nThis eBook is for the use of anyone anywher'

In [None]:
len(dracula)  #867141 characters

867141

#5. Parse entire book

Up until now, we have been parsing individual sentences. But spacy will handle a document that consists of multiple sentences. That document can be an article from a medical journal, a web page (after wrangling) and even an entire novel.

In [None]:
doc = nlp(dracula.lower())

##We can go through doc the old fashioned way

We can just ask for each token, one after the other. That is what we will end up doing. But wanted to show you spacy also parses out sentences.

In [None]:
drac_sentences = list(doc.sents)  #We will use this later

In [None]:
for i in range(5):
  print(i,drac_sentences[i+60])  #starting at 60th sentence to get past boilerplate

0 _)



1 _3 may.
2 bistritz._--left munich at 8:35 p. m.
3 , on 1st may, arriving at
vienna early next morning; should have arrived at 6:46, but train was an
hour late.
4 buda-pesth seems a wonderful place, from the glimpse which i
got of it from the train and the little i could walk through the
streets.


In [None]:
sentence64 = drac_sentences[64]
sentence64

buda-pesth seems a wonderful place, from the glimpse which i
got of it from the train and the little i could walk through the
streets.

In [None]:
for token in sentence64:
  print(token.text)

buda
-
pesth
seems
a
wonderful
place
,
from
the
glimpse
which
i


got
of
it
from
the
train
and
the
little
i
could
walk
through
the


streets
.


#Now to build the matrix

General strategy: go through every token in drac_doc and check if in color_table index. If it is, add its RGB vector to your matrix. When done, get the average RGB vector.

I'm going to use my standard gist for this.

In [None]:
#the old list is doc

drac_color_matrix = []  #the new list
color_names = color_table.index.tolist()

for token in doc:
  word = token.text
  if word in color_names:
    drac_color_matrix.append(color_table.loc[word].tolist())  #append the rgb values




In [None]:
len(drac_color_matrix)  #901 uses of a color word

901

In [None]:
avg_color = meanv(drac_color_matrix)

In [None]:
avg_color  #array([147.44839068, 113.65371809, 100.13540511])

[147.44839067702551, 113.65371809100999, 100.13540510543841]

Now, we'll pass the averaged color vector to ordering function, yielding a brown mush, which is kinda what you'd expect from adding a bunch of colors together willy-nilly.

In [None]:
ordered_embeddings(avg_color, color_table)[:10]

[[13.519858753013214, 'reddish grey'],
 [15.356247186381948, 'brownish grey'],
 [16.350106463486874, 'brownish'],
 [19.826822637698537, 'brown grey'],
 [21.824003657449868, 'mocha'],
 [26.730012587581818, 'grey brown'],
 [28.095953180857567, 'puce'],
 [28.286050911198767, 'dull brown'],
 [29.719493432987974, 'pinkish brown'],
 [31.643437130672552, 'dark taupe']]

<pre>
[u'reddish grey',
 u'brownish grey',
 u'brownish',
 u'brown grey',
 u'mocha',
 u'grey brown',
 u'puce',
 u'dull brown',
 u'pinkish brown',
 u'dark taupe']
 </pre>

On the other hand, here's what we get when we average the colors of Charlotte Perkins Gilman's classic *The Yellow Wallpaper*.  The result definitely reflects the content of the story, so maybe we're on to something here.


In [None]:
!wget {yellow_url}

--2020-05-13 16:06:07--  http://www.gutenberg.org/files/1952/1952-0.txt
Resolving www.gutenberg.org (www.gutenberg.org)... 152.19.134.47, 2610:28:3090:3000:0:bad:cafe:47
Connecting to www.gutenberg.org (www.gutenberg.org)|152.19.134.47|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 51186 (50K) [text/plain]
Saving to: ‘1952-0.txt’


2020-05-13 16:06:07 (1.11 MB/s) - ‘1952-0.txt’ saved [51186/51186]



Check to make sure it is there using another command line operation ls -l.

In [None]:
!ls -l


total 924
-rw-r--r-- 1 root root  51186 Apr 18 14:40 1952-0.txt
-rw-r--r-- 1 root root 883160 May  1 07:50 pg345.txt
drwxr-xr-x 1 root root   4096 May  4 16:26 sample_data
drwxr-xr-x 4 root root   4096 May 13 16:04 uo_puddles


Now can read it into a string.

In [None]:
with open('1952-0.txt', 'r') as f:
  yellow = f.read()

In [None]:
yellow[:100]  #first 100 characters

"Project Gutenberg's The Yellow Wallpaper, by Charlotte Perkins Gilman\n\nThis eBook is for the use of "

In [None]:
len(yellow)

50780

In [None]:
doc = nlp(yellow.lower())

In [None]:

yellow_color_matrix = []
color_names = color_table.index.tolist()

for token in doc:
  word = token.text
  if word in color_names:
    yellow_color_matrix.append(color_table.loc[word].tolist())




In [None]:
len(yellow_color_matrix)  #26 uses of a color word

26

In [None]:
avg_color = meanv(yellow_color_matrix)

In [None]:
avg_color  #[192.0, 185.26923076923077, 48.23076923076923]

[192.0, 185.26923076923077, 48.23076923076923]

In [None]:
ordered_embeddings(avg_color, color_table)[:10]

[[32.867606937512456, 'pea'],
 [34.6139529618472, 'puke yellow'],
 [35.25580652163731, 'sick green'],
 [37.701902232548726, 'vomit yellow'],
 [39.06073635071867, 'booger'],
 [39.34720654280166, 'olive yellow'],
 [40.548768942125655, 'snot'],
 [41.77193998863593, 'gross green'],
 [42.05000193486195, 'dirty yellow'],
 [42.420635957392086, 'mustard yellow']]

<pre>
[u'pea',
 u'puke yellow',
 u'sick green',
 u'vomit yellow',
 u'booger',
 u'olive yellow',
 u'snot',
 u'gross green',
 u'dirty yellow',
 u'mustard yellow']
 </pre>

Definitely captures the yellowness (in kind of a gross way!).

#Assignment 2
<img src='https://www.dropbox.com/s/3uyvp722kp5to2r/assignment.png?raw=1' width='300'>

Write a function `build_embedding_matrix` that takes as parameters a string (e.g., a book) and a table (e.g., the color_table) and produces the matrix of values.

If you are thinking about using asserts, here is one that will check to see if a variable holds a pandas table:
<pre>
assert isinstance(table, pd.core.frame.DataFrame), f'table not a dataframe but instead a {type(table)}'
</pre>

In [None]:
def build_embedding_matrix(raw_text: str, table) -> list:
  assert isinstance(raw_text, str), f'raw_text should be string but instead is {type(raw_text)}'
  assert isinstance(table, pd.core.frame.DataFrame), f'table not a dataframe but instead a {type(table)}'
  assert 'nlp' in globals(), f'This function assumes that the spacy nlp function has been defined'

  matrix = []  #new list
  index_list = table.index.tolist() #pull out the color names into a list
  doc = nlp(raw_text.lower())  #old list

  #matrix = [table.loc[token.text].tolist() for token in doc if token.text in index_list]

  for token in doc:
    word = token.text
    if word in index_list:
      matrix.append(table.loc[word].tolist())
  return matrix

Test your function out.

In [None]:
yellowmat = build_embedding_matrix(yellow, color_table)

In [None]:
yellowmat == yellow_color_matrix  #check against what we did above - should be True

True

In [None]:
dracmat = build_embedding_matrix(dracula, color_table)

In [None]:
dracmat == drac_color_matrix  #check against what we did above - should be True

True

##Congrats!

You have created a set of functions that are kind of useful.

#Assignment 3
<img src='https://www.dropbox.com/s/3uyvp722kp5to2r/assignment.png?raw=1' width='300'>

Put your function to use. Find the average color of A Tale of Two Cities. Use your function at the appropriate spot.

Match my top 10:
<pre>
[[12.272728816373057, 'dark taupe'],
 [14.961457974171816, 'cocoa'],
 [16.458517047363525, 'greyish brown'],
 [17.920407343606808, 'dull brown'],
 [19.65895044742039, 'grey brown'],
 [21.21055110632527, 'dirt'],
 [24.630416790742295, 'dark mauve'],
 [24.877166111255395, 'dirt brown'],
 [29.25547441759158, 'brownish'],
 [29.602296007881918, 'brownish grey']]
 </pre>

In [None]:
!wget {dickens_url}

--2020-05-13 16:06:23--  https://www.gutenberg.org/files/98/98-0.txt
Resolving www.gutenberg.org (www.gutenberg.org)... 152.19.134.47, 2610:28:3090:3000:0:bad:cafe:47
Connecting to www.gutenberg.org (www.gutenberg.org)|152.19.134.47|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 804335 (785K) [text/plain]
Saving to: ‘98-0.txt’


2020-05-13 16:06:24 (2.19 MB/s) - ‘98-0.txt’ saved [804335/804335]



In [None]:
with open('98-0.txt', 'r') as f:
  dickens = f.read()

In [None]:
dickens_matrix = build_embedding_matrix(dickens, color_table)

In [None]:
dickens_mean = meanv(dickens_matrix)

In [None]:
ordered_embeddings(dickens_mean, color_table)[:10]

[[12.272728816373057, 'dark taupe'],
 [14.961457974171816, 'cocoa'],
 [16.458517047363525, 'greyish brown'],
 [17.920407343606808, 'dull brown'],
 [19.65895044742039, 'grey brown'],
 [21.21055110632527, 'dirt'],
 [24.630416790742295, 'dark mauve'],
 [24.877166111255395, 'dirt brown'],
 [29.25547441759158, 'brownish'],
 [29.602296007881918, 'brownish grey']]

#What have we learned?

One means of converting a sequence of words into a single vector is to take the average of all the individual word vectors. We are taking the entire set of color words in a book and averaging. But we can also do the same thing with smaller units, e.g., sentences, tweets.

Let's take the next big jump and look at word meaning captured by word-embeddings.

#Start here Wednesday

#6. Distributional semantics

In the previous section, the examples are interesting because of a simple fact: colors that we think of as similar are "closer" to each other in RGB vector space. In our color vector space, or in our animal cuteness/size space, you can think of the words identified by vectors close to each other as being *synonyms*, in a sense: they sort of "mean" the same thing.  Think of this in terms of writing, say, a search engine. If someone searches for "mauve trousers," then it's probably also okay to show them results for, say,

In [None]:
top10 = ordered_embeddings(color_table.loc['mauve'].tolist(), color_table)[:10]
for d, name in top10:  #using unpacking to get 2 separate assignments
    print(name + " trousers")


mauve trousers
dusty rose trousers
dusky rose trousers
brownish pink trousers
old pink trousers
reddish grey trousers
dirty pink trousers
old rose trousers
light plum trousers
ugly pink trousers


That's all well and good for color words, which intuitively seem to exist in a multidimensional continuum of perception, and for our animal space, where we've written out the vectors ahead of time. But what about arbitrary words? Is it possible to create a vector space for all English words that has this same "closer in space is closer in meaning" property?

To answer that, we have to back up a bit and ask the question: what does *meaning* mean? No one really knows, but one theory popular among computational linguists, computer scientists and other people who make search engines is the [Distributional Hypothesis](https://en.wikipedia.org/wiki/Distributional_semantics), which states that:

    Linguistic items with similar distributions have similar meanings.
    
What's meant by "similar distributions" is *similar contexts*. Take for example the following sentences:

    It was really cold yesterday.
    It will be really warm today, though.
    It'll be really hot tomorrow!
    Will it be really cool Tuesday?
    
According to the Distributional Hypothesis, the words `cold`, `warm`, `hot` and `cool` must be related in some way (i.e., be close in meaning) because they occur in a similar context, i.e., between the word "really" and a word indicating a particular day. (Likewise, the words `yesterday`, `today`, `tomorrow` and `Tuesday` must be related, since they occur in the context of a word indicating a temperature.)

In other words, according to the Distributional Hypothesis, a word's meaning is just a big list of all the contexts it occurs in. Two words are closer in meaning if they share contexts.

#7. Word vectors by counting contexts

So how do we turn this insight from the Distributional Hypothesis into a system for creating general-purpose vectors that capture the meaning of words?  Let's use a small source text to begin with, such as this excerpt from Dickens:

    It was the best of times, it was the worst of times.

This spreadsheet tries to capture the context of words. 
![dickens contexts](http://static.decontextualize.com/snaps/best-of-times.png)

The spreadsheet has one column for every possible context, and one row for every word. The values in each cell correspond with how many times the word occurs in the given context. The numbers in the columns constitute that word's vector, i.e., the vector for the word `of` is

    [0, 0, 0, 0, 1, 0, 0, 0, 1, 0]
    
Because there are ten possible contexts, this is a ten dimensional space. You could use the same distance formula that we defined earlier to get useful information about which vectors in this space are similar to each other. In particular, the vectors for `best` and `worst` are actually the same (a distance of zero), since they occur only in the same context (`the ___ of`):

    [0, 0, 0, 1, 0, 0, 0, 0, 0, 0]
    
Of course, the conventional way of thinking about "best" and "worst" is that they're *antonyms*, not *synonyms*. But they're also clearly two words of the same kind, with related meanings (through opposition), a fact that is captured by this distributional model.

### Contexts and dimensionality

In a corpus (collection of text) of any reasonable size, there will be many thousands if not many millions of possible contexts. It turns out, though, that many of the dimensions end up being superfluous and can either be eliminated or combined with other dimensions without significantly affecting the predictive power of the resulting vectors. The process of getting rid of superfluous dimensions in a vector space is called [dimensionality reduction](https://en.wikipedia.org/wiki/Dimensionality_reduction), and most implementations of count-based word-vectors make use of dimensionality reduction so that the resulting vector space has a reasonable number of dimensions (say, 100—300, depending on the corpus and application).

The question of how to identify a "context" is itself very difficult to answer. In the toy example above, we've said that a "context" is just the word that precedes and the word that follows. Depending on your implementation of this procedure, though, you might want a context with a bigger "window" (e.g., two words before and after), or a non-contiguous window (skip a word before and after the given word). You might look at larger syntactic structure: what are the syntactic-contexts you find the word in? You might exclude certain "function" words like "the" and "of" when determining a word's context, or you might [lemmatize](https://en.wikipedia.org/wiki/Lemmatisation) the words before you begin your analysis, so two occurrences with different "forms" of the same word count as the same context. These are all questions open to research and debate, and different implementations of procedures for creating count-based word vectors make different decisions on this issue. In chapter 5, we eliminated stop words but we did not go as far as lemmatizing.

### GloVe vectors

But you don't have to create your own word vectors from scratch! Many researchers have made downloadable databases of pre-trained vectors. One such project is Stanford's [Global Vectors for Word Representation (GloVe)](https://nlp.stanford.edu/projects/glove/). These 300-dimensional vectors are included with spaCy, and they're the vectors we'll be using for the rest of this chapter. In fact, you already have them. They come with `en_core_web_md`. Nice.

Check this out.

In [None]:
nlp.vocab.has_vector('frankenstein')  #check to make sure word vectors have been loaded

True

In [None]:
dogv = nlp.vocab['dog'].vector  #get the 300d vector for dog

In [None]:
type(dogv)

cupy.core.core.ndarray

The vector is in a peculiar spacy data type so let's just turn it into a Python list.

In [None]:
dog_list = dogv.tolist()

In [None]:
len(dog_list)  #all spacy word vectors are length 300

300

In [None]:
dog_list[:10]

[-0.4017600119113922,
 0.37057000398635864,
 0.02128100022673607,
 -0.3412500023841858,
 0.04953800141811371,
 0.29440000653266907,
 -0.17375999689102173,
 -0.2798199951648712,
 0.06762199848890305,
 2.169300079345703]

For the sake of convenience, the following function gets the vector of a given string from spaCy's vocabulary:

In [None]:
def get_vec(s:str) -> list:
    return nlp.vocab[s].vector.tolist()

In [None]:
get_vec('dog') == dog_list  #should be the same

True

We even have a vector for words not in the vocab. It is all zeroes.

In [None]:
zero_vec = get_vec('askfsda')  #not in vocab
zero_vec.count(0)  #300 zeroes, i.e., all zeroes.

300

The following cell shows that the cosine similarity between `dog` and `puppy` is larger than the similarity between `trousers` and `octopus`.

In [None]:
up.cosine_similarity(get_vec('dog'), get_vec('puppy')) > up.cosine_similarity(get_vec('trousers'), get_vec('octopus'))

True

#8. Sentence similarity

I am going to switch gears a bit, and move us closer to doing prediction. What I will be interested in is converting an entire sentence into a single glove vector.  Here is general idea. I'll go through each row and grab the text of the sentence. I'll build a list of guarded tokens, kind of like e_list for Naive Bayes.

I'll then get the vectors for all the tokens and build a matrix. I'll then take the average. I will use that vector average as the representation of the sentence.

And as always, when I say "I", I mean you :)



##Let's work on our example sentences

In [None]:
pilot_sentences = [
  'It was really cold yesterday.',
  'It will be really warm today, though.',
  "It'll be really hot tomorrow!'",
  'Will it be really cool Tuesday?'
]

#Assignment 4
<img src='https://www.dropbox.com/s/3uyvp722kp5to2r/assignment.png?raw=1' width='300'>

Get the average vector for the first (0th) sentence in pilot_sentences. Store it in `s0_average`. Should be of length 300.

In [None]:
#your code



In [None]:
len(s0_vec)  #300

300

Check against my results

In [None]:
print(s0_vec[:10])  #[0.20979999750852585, 0.25439999997615814, 0.13877900503575802, -0.01888199895620346, -0.16211500018835068, -0.1315389983355999, -0.14700499922037125, -0.09826499223709106, -0.0666164979338646, 2.3787500858306885]

[0.20979999750852585, 0.25439999997615814, 0.13877900503575802, -0.01888199895620346, -0.16211500018835068, -0.1315389983355999, -0.14700499922037125, -0.09826499223709106, -0.0666164979338646, 2.3787500858306885]


#Assignment 5
<img src='https://www.dropbox.com/s/3uyvp722kp5to2r/assignment.png?raw=1' width='300'>

I want to get the average for all the sentences in pilot_sentences. Instead of tediously copying and pasting code, please write a function `sent2vec` that takes as parameter a sentence (raw string) and produces the average glove vector. So you are packaging up the steps above.

However, there is one twist. If you run across a sentence that adds nothing to the matrix, i.e., it has no legal tokens, then just return this value:
<pre>
[0.0]*300  #produces a list of 300 zeroes.
</pre>
That will build a 300 element list that is all 0.0.

In [None]:
#your code


Test it on first sentence and see if we match what we got by hand.

In [None]:
s0_vec == sent2vec(pilot_sentences[0])  #True

True

Check on weird sentence.

In [None]:
sent2vec('\n \n')[:10]  #[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]

[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]

Let's do a cross-check on how similar the pilot sentences are.

Here they are again for reference.

<pre>
0 'It was really cold yesterday.',
1 'It will be really warm today, though.',
2 "It'll be really hot tomorrow!'",
3 'Will it be really cool Tuesday?'
</pre>

My results:
<pre>
0 1 0.7743651246870916
0 2 0.724727875606952
0 3 0.6154703833714615
1 2 0.7275724681645338
1 3 0.6179968179574922
2 3 0.6912142233533577
</pre>

In [None]:
for i in range(0, len(pilot_sentences)-1):
  for j in range(i+1, len(pilot_sentences)):
    av1 = sent2vec(pilot_sentences[i])
    av2 = sent2vec(pilot_sentences[j])
    sim = up.cosine_similarity(av1, av2)
    print(i, j, sim)

0 1 0.7743651246870916
0 2 0.724727875606952
0 3 0.6154703833714615
1 2 0.7275724681645338
1 3 0.6179968179574922
2 3 0.6912142233533577


They are relatively close.

Let's add a random sentence and try again.

In [None]:
pilot_sentences.append('It was the best of times and it was the worst of times.')

In [None]:
for i in range(0, len(pilot_sentences)-1):
  for j in range(i+1, len(pilot_sentences)):
    av1 = sent2vec(pilot_sentences[i])
    av2 = sent2vec(pilot_sentences[j])
    sim = up.cosine_similarity(av1, av2)
    print(i, j, sim)

0 1 0.7743651246870916
0 2 0.724727875606952
0 3 0.6154703833714615
0 4 0.5446654613600631
1 2 0.7275724681645338
1 3 0.6179968179574922
1 4 0.529159756265171
2 3 0.6912142233533577
2 4 0.4915770408238424
3 4 0.42069515226065557


The new sentence is less similar than the others, so that makes logical sense.

#9. Finding sentence matches in a document

I'll create a test sentence and then look for matching sentences in Dracula. You could do the same for a document or a journal article.

First, glove-ify all the Dracula sentences and place in a matrix.

Warning: this took me about 5 minutes. But once I have it, I can quickly try different sentences.

Also note that this matrix could serve as a KNN matrix. What it is missing is a label. I have no labels on the Dracula sentences. So we are just exploring our data at this point.

In [None]:
#drac_matrix = [sent2vec(drac_sentences[i].text) for i in range(len(drac_sentences))]

drac_matrix = []

for i in range(len(drac_sentences)):  #we defined drac_sentences above
  sentence = drac_sentences[i]
  vec = sent2vec(sentence.text)
  drac_matrix.append(vec)

In [None]:
test_sentence = "My favorite food is strawberry ice cream."

Ok, find sentences in Dracula that are closest to this using sent2vec.


In [None]:
input_vec = sent2vec(test_sentence)

In [None]:
import numpy as np

ordered_distances = []

for i in range(len(drac_matrix)):  #we defined drac_sentences above
  vec = drac_matrix[i]
  d = up.fast_cosine(np.array(input_vec), np.array(vec))  #using speedier version that relies on numpy
  ordered_distances.append([d, i])


In [None]:

for d,j in sorted(ordered_distances, reverse=True)[:10]:
  print(drac_sentences[j])
  print('=========')


we get hot soup, or coffee, or tea; and
off we go.
i had for breakfast more paprika, and a sort of porridge of maize flour
which they said was "mamaliga," and egg-plant stuffed with forcemeat, a
very excellent dish, which they call "impletata."
this, with some cheese
and a salad and a bottle of old tokay, of which i had two glasses, was
my supper.
would none of you like a cup of tea?
a chicken done up some way with red pepper, which was
very good but thirsty.
there was everywhere a bewildering mass of fruit blossom--apple,
plum, pear, cherry; and as we drove by i could see the green grass under

i dined on what they
called "robber steak"--bits of bacon, onion, and beef, seasoned with red
pepper, and strung on sticks and roasted over the fire, in the simple
style of the london cat's meat!
; for there be folk that do think a balm-bowl be
like the sea, if only it be their own.
if he can't get food
come, and we'll have a cup of tea somewhere.


##Pretty dang impressive

The book did not contain the main topic of our test sentence, "strawberry ice cream". However, sentences that had to do with that and "food" were found. Note that none of these sentences contains  "strawberry ice cream" and only one "food".

##How about this :)

In [None]:
test_sentence = "The blood bank is looking for donors."

In [None]:
input_vec = sent2vec(test_sentence)

In [None]:
import numpy as np

ordered_distances = []

for i in range(len(drac_matrix)):  #we defined drac_sentences above
  vec = drac_matrix[i]
  d = up.fast_cosine(np.array(input_vec), np.array(vec))  #using speedier version that relies on numpy
  ordered_distances.append([d, i])


In [None]:

for d,j in sorted(ordered_distances, reverse=True)[:5]:
  print(drac_sentences[j])
  print('=========')


whereupon the captain tell him that he
had better be quick--with blood--for that his ship will leave the
place--of blood--before the turn of the tide--with blood.
we must have
another transfusion of blood, and that soon, or that poor girl's life
won't be worth an hour's purchase.
he had been paid for his work by
an english bank note, which had been duly cashed for gold at the danube
international bank.
i don't care for the pale people; i like them with lots of blood
in them, and hers had all seemed to have run out.
"do you mean to tell me, friend john, that you have no suspicion as to
what poor lucy died of; not after all the hints given, not only by
events, but by me?"

"of nervous prostration following on great loss or waste of blood."

"and how the blood lost or waste?


Not as impressive. Might get the same by just searching for all sentences that contain "blood'. One exception is sentence starting with ""do you mean to tell me, friend john,". I might argue that this does involve blood in a very subtle way.



#10. Bias creeps in

You can kind of expect it, right? If we build word-vectors from today's web content, and that web content is biased, then we will end up with bias creeping into word-vectors.

One of the magical outcomes of word-vectors is that we can  math on them in a similar fashion to our animal and color examples.

Here is a diagram of an example that was widely reported:

<img src='https://www.dropbox.com/s/0norjklo12ebemj/Screenshot%202020-05-08%2011.21.29.png?raw=1'>

In essence, find the difference between man and woman (shown as a gender distance). Then find  king and subtract the gender distance to get queen. I know you might wonder if gender and royal are really features in a word-vector, kind of like size and cuteness were for animals. Yes, but they are spread through the entire vector. So cannot point to vector[i] and say that is the gender determiner.


In [None]:
woman_vec = get_vec('woman')
man_vec = get_vec('man')
king_vec = get_vec('king')
queen_vec = get_vec('queen')

In [None]:
gender_dist = subtractv(man_vec, woman_vec)
X = subtractv(king_vec, gender_dist)

In [None]:
up.cosine_similarity(X, queen_vec)

0.7880844327411434

Fairly high as we would expect. BTW: this is pretty dang impressive if you ask me.


So what's the bias? Check another early example out.

<img src='https://miro.medium.com/max/1400/1*DZa3CnBeyjyCrwy1wMFGNg.png'>

In essence, if you asked for an ordering of the vectors closest to doctor, you would see "man" in that top 10 list but not "woman". The reverse for nurse.

As I said, the word-vectors are trained on web content. They will pick up whatever bias exists in that content.

The word-embedding gurus took so much heat from these early examples that they tried to de-bias the vectors. Here is one paper that gives a high-level overview: https://medium.com/@dhartidhami/bias-in-word-embeddings-4ce8e4261c7

Let's see how they are doing.

In [None]:
doctor_vec = get_vec('doctor')
nurse_vec = get_vec('nurse')

In [None]:
X = subtractv(doctor_vec, gender_dist)

In [None]:
up.cosine_similarity(X, nurse_vec)

0.7022648245931439

Still kind of a problem.

Let's try one more thing. This uses a similarity method built-in to spacy.

In [None]:
x = nlp("man")
y = nlp("nurse")
x.similarity(y)

array(0.28819457, dtype=float32)

In [None]:
x = nlp("woman")
y = nlp("nurse")
x.similarity(y)

array(0.49859723, dtype=float32)

I think they still have a ways to go.

#Stopping here for now

In coming weeks, I want to look at using something like `sent2vec` in a prediction model. I hope you can see we could use it with something like KNN. Take each tweet, use `sent2vec` to produce a row of 300 values, store all the rows in our crowd table. Given a new tweet, convert it using `sent2vec`, then find the k closest.

But I want to explore using word-vectors in a new algorithm, Artificial Neural Nets (ANNs). Stay tuned.

#End notes

For you Linguistics fans out there, here is a paper that describes the impact word-embedding is having on the Linguistic field. You may want to circle back to this after you have had some practice with it.
https://www.semanticscholar.org/paper/Distributional-Semantics-and-Linguistic-Theory-Boleda/510928a367d51d9ee294dd8160cc0bd66f796c60

For you Digital Humanities fans, here is a paper that discusses the use of word-embeddings in 19th century literature. The interesting parts for me are (a) it shows why you might want to build your own word-vectors (moderately easyish) to fit text from a specific period or domain (e.g., 19th century, Medicine), and (b) why historians might want to leave biased language alone (i.e., not try to remove bias) because they want to study its evolution. Again, you might circle back to this at end of quarter.
http://ryanheuser.org/word-vectors/