<a href="https://colab.research.google.com/github/saralieber/CS_Studio/blob/master/Review_Ch7_Text_to_Vectors_And_Averages.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Embeddings

Take a complex structure and reduce it to a vector encoding. The vector encoding should be low dimensional and compact. The vector captures "features" of the structure in a way that structures that are similar have similar vectors.

We will look at how to "embded" text into a vector

## Example

Find similarities among these words and the creatures they designate. First, look for similarities between `kitten` and `hamster`.

![Animal spreadsheet](http://static.decontextualize.com/snaps/animal-spreadsheet.png)

We will come up with two features that we think are important for animals - cuteness and size.

Each word is associated with a 2d vector (cuteness, size).


In [2]:
# Build the pandas version of the table above
## Use a pandas method that converts a list of row values in a table

## First, create a list of lists representing the rows in the above table 
## [['animal_name', cuteness, size]]
rows = [
    ['kitten', 95,15],
    ['hamster', 80,8],
    ['tarantula', 8,3],
    ['puppy', 90, 20],
    ['crocodile', 5, 40],
    ['dolphin', 60,45],
    ['panda bear', 75, 40],
    ['lobster', 2, 15],
    ['capybara', 70, 30],
    ['elephant', 65, 90],
    ['mosquito', 1, 1],
    ['goldfish', 25, 2],
    ['horse', 50, 50],
    ['chicken', 25, 15]
]


# Then, create the column names
columns = ['animal', 'cuteness', 'size']


# Then, use the pd.DataFrame.from_records() function to convert the above into a table
## pd.DataFrame.from_records(data_with_rows, columns = data_with_column_names)
import pandas as pd
animal_table = pd.DataFrame.from_records(rows, columns=columns)
animal_table = animal_table.set_index(['animal']) # make animal names the index
animal_table

Unnamed: 0_level_0,cuteness,size
animal,Unnamed: 1_level_1,Unnamed: 2_level_1
kitten,95,15
hamster,80,8
tarantula,8,3
puppy,90,20
crocodile,5,40
dolphin,60,45
panda bear,75,40
lobster,2,15
capybara,70,30
elephant,65,90


You can begin to ask questions like... 

*Which animal is most similar to a capybara?*

Some methods would be to use the capybara as the reference group and compare each other row using euclidean distance or cosine similarity.

Visualize the animals in 2-dimensional space based on their cuteness and size scores:

![Animal space](http://static.decontextualize.com/snaps/animal-space.png)

In [3]:
# Flush the old uo_puddles directory and re-import
!rm -r 'uo_puddles'
my_github_name = 'uo-puddles' # can replace with your account name
clone_url = f'https://github.com/{my_github_name}/uo_puddles.git'
!git clone $clone_url
import uo_puddles.uo_puddles as up

rm: cannot remove 'uo_puddles': No such file or directory
Cloning into 'uo_puddles'...
remote: Enumerating objects: 234, done.[K
remote: Counting objects: 100% (234/234), done.[K
remote: Compressing objects: 100% (198/198), done.[K
remote: Total 234 (delta 139), reused 64 (delta 33), pack-reused 0[K
Receiving objects: 100% (234/234), 59.83 KiB | 7.48 MiB/s, done.
Resolving deltas: 100% (139/139), done.


In [4]:
# Based on graph, pandas look most similar to capybaras
# Calculate euclidean distances between pandas and capybaras using the euclidean_distance function from the uo_puddles library

up.euclidean_distance([70,30], [75,40])

11.180339887498949

In [5]:
# Based on the graph, tarantula and elephant look for away from each other
# Express this differnce as a number

up.euclidean_distance([8,3], [65,90])

104.0096149401583

In [0]:
# You can find which animal is closest to any point chosen point in space.
## Write a function below for doing so

def ordered_embeddings(target_vector, table): # define a new function, called ordered_embeddings, that takes a target_vector and a table as inputs
                                              # the target_vector needs to be a list (convert to a list if it's not)
  names = table.index.tolist() # names is a list of the indexes from the provided table (in this case, would be the animal names)
  ordered_list = [] # the results ordering difference between each animal and the target_animal will be listed in order here
  for i in range(len(names)): # for each animal row
    name = names[i] # name is an interation of each row in the animal table
    row = table.loc[name].tolist() # convert each row to a list
    d = up.euclidean_distance(target_vector, row) # calculate distance between the target_animal and the animal in each other row of the table
    ordered_list.append([d, names[i]]) # fill the ordered_list with the calculated distances and names of each animal
  ordered_list = sorted(ordered_list) # sort the list from lowest to highest distance

  return ordered_list

In [7]:
# Use puppy as the target_vector and calculate distances

pup = animal_table.loc['puppy'].tolist()
ordered_embeddings(pup,animal_table)

# puppy is most similar to a puppy (d = 0), and then a kitten (d = 7.07)

[[0.0, 'puppy'],
 [7.0710678118654755, 'kitten'],
 [15.620499351813308, 'hamster'],
 [22.360679774997898, 'capybara'],
 [25.0, 'panda bear'],
 [39.05124837953327, 'dolphin'],
 [50.0, 'horse'],
 [65.19202405202648, 'chicken'],
 [67.446274915669, 'goldfish'],
 [74.33034373659252, 'elephant'],
 [83.74365647617735, 'tarantula'],
 [87.32124598286491, 'crocodile'],
 [88.14193099768123, 'lobster'],
 [91.00549433962765, 'mosquito']]

In [12]:
# Calculate what is halfway between a chicken and an elephant

elephant = animal_table.loc['elephant'].tolist()
chicken = animal_table.loc['chicken'].tolist()

zipped = zip(elephant, chicken) # the zip() function returns a zip object, which is an interator of tuples where the first item in each passed iterator is paired together, and then the second item, etc.
list(zipped) # [(elephant_cutness, chicken_cuteness), (elephant_size, chicken_size)]

zip

In [0]:
half_vector = [(e+c)/2 for e,c in zip(elephant,chicken)]
half_vector

In [13]:
# Now, use the ordered_embeddings function to calculate the distance between the half_vector and each animal in the animal_table

ordered_embeddings(half_vector, animal_table) # The horse is closest to the average of a chicken & an elephant

[[5.5901699437494745, 'horse'],
 [16.77050983124842, 'dolphin'],
 [32.5, 'panda bear'],
 [33.63406011768428, 'capybara'],
 [41.907636535600524, 'crocodile'],
 [42.5, 'chicken'],
 [42.5, 'elephant'],
 [54.31620384378864, 'goldfish'],
 [55.509008277936296, 'puppy'],
 [56.61492736019362, 'hamster'],
 [57.05479822065801, 'lobster'],
 [61.80008090609591, 'tarantula'],
 [62.5, 'kitten'],
 [67.73662229547618, 'mosquito']]

In [16]:
# Calculate the difference between a tarantula and a hamster
tarantula = animal_table.loc['tarantula'].tolist()
hamster = animal_table.loc['hamster'].tolist()
tar_ham_diff = up.euclidean_distance(tarantula, hamster)
tar_ham_diff

72.17340230306452

In [0]:
# Tarantulas are to hamsters as chickens are to ____?
## That is, what is the approximately same distance from a chicken as a hamster is from a tarantula?

chicken = animal_table.loc['chicken'].tolist()
chick_distances = ordered_embeddings(chicken, animal_table) # Calculate all distances between other animals from a chicken

In [19]:
# Calculate which distance from above table is closest to the tar_ham_diff (72)

## We want the minimum value when subtracting 72.17 from each distance in the chick_distances table
min(chick_distances, key=lambda x:abs(x[0]-tar_ham_diff)) 

# The kitten, which is a euclidean distance of 70 units from a chicken, is to a chicken what a hamster is to a tarantula

[70.0, 'kitten']

## Colors

Colors are words that can be represented as vectors with three dimensions: red, green, and blue. By representing colors in 3d space, we can ask questions like...

*Which colors are similar? Given the name of two colors, what's the name of the average of the two?*

In [21]:
# This url contains color RGB values

url = 'https://docs.google.com/spreadsheets/d/e/2PACX-1vT7WNqkgfIYL5AWgb8aGSGhvh3wo-JwTQlJzN1Y2LYH09fzLtfeKHMDau9s6PcOBwU01-DfbPuEzhTZ/pub?output=csv' # This url contains color RGB values

color_table = pd.read_csv(url, encoding='utf-8', dtype={'color':str}, na_filter=False) # use pd.read_csv to convert the excel file to a table
color_table = color_table.set_index(['color']) # set the color names as the index
color_table.head() # look at top of table

Unnamed: 0_level_0,hex,red,green,blue
color,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
acid green,#8ffe09,143,254,9
adobe,#bd6c48,189,108,72
algae,#54ac68,84,172,104
algae green,#21c36f,33,195,111
almost black,#070d0d,7,13,13


In [0]:
# Write a function that converts colors from hex format (#1a2b3c) to a tuple of integers

def hex_to_int(s:str) -> tuple: # define a function, caleld hex_to_int, that takes input s (a string variable) and converts it to a tuple
  assert isinstance(s, str), f's must be a string but is instead a {type(s)}' # assert messages provide errors if the input given to the fxn isn't in correct format
  assert len(s) == 7, f's must be 7 characters long but is instead {len(s)}'
  assert s[0] == '#', f's must start with a # but instead starts with an {s[0]}'

  s = s.lstrip("#") # strip off the # from the left-hand side of the hex code
  red = int(s[:2], 16) # int is a built-in python function that converts a given hexcode (e.g., #8ffe09) and converts it to a number in base 16 
                        # the [:2] means take the first two digits of the given hex code (e.g., 8f) and convert that to the number in base 16 for red
  green = int(s[2:4], 16) [2:4] # means take the third to fourth digits of the hex code (e.g., fe)
  blue = int(s[4:6], 16) # [4:6] - take the fifth and sixth digits of the hex code (e.g., 09)
  return [red, green, blue] # return the integers for RGB as a list (a tuple)
  

In [23]:
# Calculate cosine differences
## Reminder: cosine similarity values range from 0 to 1 (1 being an exact match)

# First, drop the hex column from the table since we're representing the colors as their integer RGB equivalents now
## the .drop method allows you to name columns you want to drop from a table; axis = 1 means drop by columns
color_table = color_table.drop(['hex'], axis = 1)
color_table.head()

Unnamed: 0_level_0,red,green,blue
color,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
acid green,143,254,9
adobe,189,108,72
algae,84,172,104
algae green,33,195,111
almost black,7,13,13


In [24]:
# Calculate cosine similarity between olive and red

olive_vector = color_table.loc['olive'].tolist()
red_vector = color_table.loc['red'].tolist()
up.cosine_similarity(olive_vector, red_vector)

0.6823879113063314

In [25]:
# Find the color closest to red
ordered_embeddings(red_vector, color_table)[:10] # just first 10 entries

# Fire engine red is closest to red

[[0.0, 'red'],
 [25.079872407968907, 'fire engine red'],
 [29.068883707497267, 'bright red'],
 [45.552167895721496, 'tomato red'],
 [45.73838650411709, 'cherry red'],
 [46.33573135281238, 'scarlet'],
 [53.563046963368315, 'vermillion'],
 [56.2672195865408, 'orangish red'],
 [56.49778756730214, 'cherry'],
 [59.84981202978001, 'lipstick red']]

In [0]:
# Write a function for subtracting one vector from another

def subtractv(x:list, y:list) -> list:
  assert isinstance(x, list), f"x must be a list but instead is {type(x)}"
  assert isinstance(y, list), f"y must be a list but instead if {type(y)}"
  assert len(x) == len(y), f"x and y must be the same length"

  result = [] # blank list to contain results of subtracting each item in x and y
  for i in range(len(x)):
    c1 = x[i]
    c2 = y[i]
    result.append(c1-c2)
  return result

In [28]:
# Practice using subtractv

v1 = [5,10,20]
v2 = [1,2,3]

# expected v1-v2 = [4,8,17]

subtractv(v1,v2)

[4, 8, 17]

In [0]:
# Write a function for adding two vectors

def addv(x:list, y:list) -> list: # define a function, called addv, that takes variables x & y (both lists) as inputs
  assert isinstance(x, list), f"x must be a list but instead is {type(x)}"
  assert isinstance(y, list), f"y must be a list but instead is {type(y)}"
  assert len(x) == len(y), f"x and y must be the same length"

  #result = [(c1 - c2) for c1, c2 in zip(x, y)]  #one-line compact version - called a list comprehension

  result = []
  for i in range(len(x)):
    c1 = x[i]
    c2 = y[i]
    result.append(c1+c2)

  return result

In [30]:
# Add v1 and v2 from above
# Expected v1+v2 = [6,12,23]

addv(v1,v2)

[6, 12, 23]

In [0]:
# Write a function for dividing a vector by a number

def dividev(x:list, y:int) -> list:
  assert isinstance(x, list), f"x must be a list but instead is {type(x)}"
  assert isinstance(y, int), f"y must be an integer but instead is {type(y)}"

  result = []
  for i in range(len(x)): 
    c1 = x[i]
    result.append(c1/y)
  return result 

In [36]:
# Divide v1 by c

v1 = [2,10,20]
c = 2

# Expected v1/c = [1,5,10]

dividev(v1,c)

[1.0, 5.0, 10.0]

In [0]:
# Write a function for calculating the mean vector from a matrix

def meanv(matrix:list) -> list:
  assert isinstance(matrix, list), f"matrix must be a list but instead is {type(x)}"
  assert len(matrix) >=1, f"matrix must have at least one row"

  sumv = matrix[0] # start with the first row
  for row in matrix[1:]: # add each row to the first row, starting with the second row
    sumv = addv(sumv, row) # take the sum of the first+second row, and then this resulting sum plus the third row
  mean = dividev(sumv, len(matrix)) # divide the sum of all the rows by the number of rows
  return mean

In [58]:
# Test the meanv function using matrix A

A = [[0,1], 
     [2,2], 
     [4,3]]

A[0] # [0,1]
A[1] # [2,2]
A[2] # [4,3]
len(A) # 3

# Expected result - note: the numbers in corresponding positions get added using the addv function
## First iteration:
# sumv = A[0] + A[1] = ([0,1]+[2,2]) = [2,3]
## Second iteration:
# sumv = [2,3] + A[2] = ([2,3]+[4,3]) = [6,6]
## mean = [6,6]/3 = [2.0,2.0]

meanv(A)

[2.0, 2.0]

In [61]:
# A fancier version of creating the meanv function above
## zip(*matrix) is a gist that stands for matrix transpose

transpose_A = list(zip(*A)) # transpose matrix A

# Expected result - columns become rows
# [[0, 2, 4], 
# [1, 2, 3]]

print(transpose_A) # [(0,2,4), (1,2,3)]


# Next, calculate the sum of each row
row_sums = [sum(row) for row in transpose_A]
print(row_sums) # [6,6]


# Next, divide the row sums by the # of rows in the original matrix
dividev(row_sums, len(A)) # [2.0, 2.0]

[(0, 2, 4), (1, 2, 3)]
[6, 6]


[2.0, 2.0]

In [63]:
# Another approach
## Using the reduce function

from functools import reduce

row_sums = reduce(addv, A) # add each of the pairs in matrix A together using addv
row_sums # [6,6]

dividev(row_sums, len(A)) # [2.0, 2.0]

[2.0, 2.0]

In [64]:
# Find the ten colors closest to the average of 'black' and 'white'

black = color_table.loc['black'].tolist()
white = color_table.loc['white'].tolist()

average_bw = [(b+w)/2 for b,w in zip(black,white)]
average_bw

[127.5, 127.5, 127.5]

In [66]:
ordered_embeddings(average_bw, color_table)[:10]

[[4.330127018922194, 'medium grey'],
 [18.567444627627143, 'purple grey'],
 [19.716744153130353, 'steel grey'],
 [21.511624764298954, 'battleship grey'],
 [22.46664193866097, 'grey purple'],
 [24.14021540914662, 'purplish grey'],
 [24.264171117101856, 'greyish purple'],
 [25.470571253900058, 'steel'],
 [26.12948526090784, 'warm grey'],
 [26.205915362757317, 'green grey']]

In [67]:
# Find the ten colors closest to what you get by subtracting "red" from "purple"

red = color_table.loc['red'].tolist()
purple = color_table.loc['purple'].tolist()

subtract_rp = subtractv(red,purple)
subtract_rp

[103, -30, -156]

In [68]:
ordered_embeddings(subtract_rp, color_table)[:10]

[[160.63934760823702, 'blood'],
 [161.48374531202822, 'dark red'],
 [161.67250848551834, 'mahogany'],
 [162.46230331987786, 'dried blood'],
 [163.7192719260625, 'deep brown'],
 [167.2154299100415, 'deep red'],
 [167.58878244083044, 'reddy brown'],
 [168.12197952677099, 'blood red'],
 [168.6297719858507, 'indian red'],
 [169.7203582367183, 'chocolate brown']]

In [69]:
# Find the ten closest closest to blue plus green

blue = color_table.loc['blue'].tolist()
green = color_table.loc['green'].tolist()

add_bg = addv(blue,green)
add_bg

[24, 243, 249]

In [70]:
ordered_embeddings(add_bg, color_table)[:10]

[[14.212670403551895, 'bright turquoise'],
 [15.0996688705415, 'bright light blue'],
 [20.73644135332772, 'bright aqua'],
 [27.49545416973504, 'cyan'],
 [33.34666400106613, 'neon blue'],
 [38.3275357934736, 'aqua blue'],
 [42.49705872175156, 'bright cyan'],
 [45.05552130427524, 'bright sky blue'],
 [49.09175083453431, 'aqua'],
 [56.2672195865408, 'bright teal']]

In [71]:
# _____ is to blue as pink is to red?

# In other words...
## (pink - red) ~ (X - blue)
# or
## (pink - red) + blue ~ X

pink = color_table.loc['pink'].tolist()

pink_to_red = subtractv(pink,red) # pink minus red
pink_minus_red_plus_blue = addv(pink_to_red,blue) # pink minus red plus blue

ordered_embeddings(pink_minus_red_plus_blue, color_table)[:10] # neon blue

[[163.29727493133498, 'neon blue'],
 [163.44418007380992, 'bright sky blue'],
 [170.0764533967004, 'bright light blue'],
 [172.97976760303501, 'cyan'],
 [174.54512310574592, 'bright cyan'],
 [176.39727889057698, 'bright turquoise'],
 [178.23860412379804, 'clear blue'],
 [178.54131174604942, 'azure'],
 [178.92456511055155, 'dodger blue'],
 [180.95303258028034, 'lightish blue']]

In [74]:
# Calculate the average of all the colors and find the ten colors closest to the average

matrix = []

for i in color_table.index.values.tolist():
  matrix.append(color_table.loc[i].tolist())
matrix[:5]

[[143, 254, 9], [189, 108, 72], [84, 172, 104], [33, 195, 111], [7, 13, 13]]

In [75]:
average_all_colors = meanv(matrix)
average_all_colors

[141.5690200210748, 134.3119072708114, 107.93888303477344]

In [76]:
ordered_embeddings(average_all_colors, color_table)[:10]

[[4.602553484191201, 'brown grey'],
 [21.139113217848376, 'reddish grey'],
 [21.428001832576594, 'brownish grey'],
 [24.206525784721993, 'medium grey'],
 [25.59966041635289, 'green grey'],
 [26.10522478271917, 'warm grey'],
 [27.96545249675546, 'dark khaki'],
 [30.709594237593613, 'grey green'],
 [32.567254561942164, 'grey/green'],
 [33.166059500486384, 'greeny grey']]

## Using vectorized language for academic ends

In [0]:
import spacy

spacy can use a Graphics Processing Unit (GPU) if it's available (don't have to, though). 

You can turn on colab's GPU under Runtime/Change runtime type (but it will restart your kernel, so you'll have to run all cells above again).

The benefit of a GPU is that it *may* make your spacy code run faster.

In [79]:
# This code turns on the GPU
spacy.prefer_gpu() # True if you have GPU turned on, False if you want to run normally

False

In [0]:
!python -m spacy download en_core_web_md # download the dictionary
import en_core_web_md
nlp = en_core_web_md.load()

In [82]:
nlp.vocab.length # the dictionary contains about 1.3M words

1340241

In [84]:
nlp.vocab.__contains__('marvelous') # Check to see if a word is contained in the dictionary

True

## Let's check out the color of some classic books.

To calculate the average of an entire book, we'll follow these steps:

1. Parse the book into words using spacy's `nlp` method.
2. Check every word to see if it names a color in our vector space, i.e., the color_table. If it does, add it to a list of vectors.
3. Find the average of that list of vectors.
4. Find the color(s) closest to that average vector.

In [0]:
# Import three classic texts
dracula_url = 'http://www.gutenberg.org/cache/epub/345/pg345.txt'
dickens_url = 'https://www.gutenberg.org/files/98/98-0.txt'  
yellow_url = 'http://www.gutenberg.org/files/1952/1952-0.txt'

In [87]:
# These are .txt files and not csv files
## We need to go through a few more steps to load these into Python

!wget {dracula_url}   # this brings the file into temporary storage in colab
!ls -l      # this code will check to see if the .txt file is in colab's temp storage

--2020-05-16 00:42:00--  http://www.gutenberg.org/cache/epub/345/pg345.txt
Resolving www.gutenberg.org (www.gutenberg.org)... 152.19.134.47, 2610:28:3090:3000:0:bad:cafe:47
Connecting to www.gutenberg.org (www.gutenberg.org)|152.19.134.47|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 883160 (862K) [text/plain]
Saving to: ‘pg345.txt’


2020-05-16 00:42:00 (1.92 MB/s) - ‘pg345.txt’ saved [883160/883160]

total 872
-rw-r--r-- 1 root root 883160 May  1 07:50 pg345.txt
drwxr-xr-x 1 root root   4096 May 13 16:29 sample_data
drwxr-xr-x 4 root root   4096 May 15 21:48 uo_puddles


In [0]:
# Now, read the dracula_url .txt file into a string

with open('pg345.txt', 'r') as f:
  dracula = f.read()

In [90]:
type(dracula)

str

In [91]:
dracula[:100]

'\ufeffThe Project Gutenberg EBook of Dracula, by Bram Stoker\n\nThis eBook is for the use of anyone anywher'

In [92]:
len(dracula)

867141

In [0]:
# Parse the entire book into words

doc = nlp(dracula.lower())

In [0]:
# spacy can also parse the book by sentences

drac_sentences = list(doc.sents)

In [97]:
for i in range(5):
  print(i, drac_sentences[i+60]) # print the five sentences starting with sentence 60

0 _)



1 _3 may.
2 bistritz._--left munich at 8:35 p. m.
3 , on 1st may, arriving at
vienna early next morning; should have arrived at 6:46, but train was an
hour late.
4 buda-pesth seems a wonderful place, from the glimpse which i
got of it from the train and the little i could walk through the
streets.


In [98]:
sentence64 = drac_sentences[64]
sentence64

buda-pesth seems a wonderful place, from the glimpse which i
got of it from the train and the little i could walk through the
streets.

In [99]:
for token in sentence64:
  print(token.text)

buda
-
pesth
seems
a
wonderful
place
,
from
the
glimpse
which
i


got
of
it
from
the
train
and
the
little
i
could
walk
through
the


streets
.


## Now, build a matrix of words from dracula that are also colors 

By matching them with names from the color_table index

In [0]:
drac_color_matrix = []
color_names = color_table.index.tolist()

for token in doc:
  word = token.text
  if word in color_names:
    drac_color_matrix.append(color_table.loc[word].tolist()) # append the rgb values

In [101]:
len(drac_color_matrix) # 901 colors mentions

901

In [102]:
drac_color_matrix[:10]

[[229, 0, 0],
 [244, 208, 84],
 [255, 255, 255],
 [255, 255, 255],
 [255, 255, 255],
 [172, 116, 52],
 [0, 0, 0],
 [0, 0, 0],
 [27, 36, 49],
 [78, 81, 139]]

In [104]:
avg_color = meanv(drac_color_matrix)
avg_color

[147.44839067702551, 113.65371809100999, 100.13540510543841]

In [105]:
ordered_embeddings(avg_color, color_table)[:10]

[[13.519858753013214, 'reddish grey'],
 [15.356247186381948, 'brownish grey'],
 [16.350106463486874, 'brownish'],
 [19.826822637698537, 'brown grey'],
 [21.824003657449868, 'mocha'],
 [26.730012587581818, 'grey brown'],
 [28.095953180857567, 'puce'],
 [28.286050911198767, 'dull brown'],
 [29.719493432987974, 'pinkish brown'],
 [31.643437130672552, 'dark taupe']]

## Do the same for *The Yellow Wallpaper*

In [106]:
!wget {yellow_url}
!ls -l

--2020-05-16 00:58:27--  http://www.gutenberg.org/files/1952/1952-0.txt
Resolving www.gutenberg.org (www.gutenberg.org)... 152.19.134.47, 2610:28:3090:3000:0:bad:cafe:47
Connecting to www.gutenberg.org (www.gutenberg.org)|152.19.134.47|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 51186 (50K) [text/plain]
Saving to: ‘1952-0.txt’


2020-05-16 00:58:27 (755 KB/s) - ‘1952-0.txt’ saved [51186/51186]

total 924
-rw-r--r-- 1 root root  51186 Apr 18 14:40 1952-0.txt
-rw-r--r-- 1 root root 883160 May  1 07:50 pg345.txt
drwxr-xr-x 1 root root   4096 May 13 16:29 sample_data
drwxr-xr-x 4 root root   4096 May 15 21:48 uo_puddles


In [0]:
with open('1952-0.txt', 'r') as f:
  yellow = f.read()

In [0]:
len(yellow)

In [0]:
yellow[:100]

In [0]:
doc = nlp(yellow.lower())

In [0]:
yellow_color_matrix = []
color_names = color_table.index.tolist()

for token in doc:
  word = token.text
  if word in color_names:
    yellow_color_matrix.append(color_table.loc[word].tolist())

In [110]:
len(yellow_color_matrix) # 26 colors

26

In [111]:
avg_color = meanv(yellow_color_matrix)
avg_color

[192.0, 185.26923076923077, 48.23076923076923]

In [112]:
ordered_embeddings(avg_color, color_table)[:10]

[[32.867606937512456, 'pea'],
 [34.6139529618472, 'puke yellow'],
 [35.25580652163731, 'sick green'],
 [37.701902232548726, 'vomit yellow'],
 [39.06073635071867, 'booger'],
 [39.34720654280166, 'olive yellow'],
 [40.548768942125655, 'snot'],
 [41.77193998863593, 'gross green'],
 [42.05000193486195, 'dirty yellow'],
 [42.420635957392086, 'mustard yellow']]

Write a function called `build_embedding_matrix` that takes a string (e.g., a book) and a table (e.g., the color_table) as inputs and produces a matrix of values from it (e.g., the RGB values)

In [0]:
def build_embedding_matrix(raw_text: str, table) -> list:
  assert isinstance(raw_text, str), f'raw_text should be string but instead is {type(raw_text)}'
  assert isinstance(table, pd.core.frame.DataFrame), f'table not a dataframe but instead a {type(table)}'
  assert 'nlp' in globals(), f'This function assumes that the spacy nlp function has been defined'

  matrix = []
  index_list = table.index.tolist()
  doc = nlp(raw_text.lower())

  # short version
  # matrix = [table.loc[token.text].tolist() for token in doc if token.text in index_list]

  for token in doc:
    word = token.text
    if word in index_list:
      matrix.append(table.loc[word].tolist())
  return matrix

In [0]:
# Test the build_embedding_matrix function on the Yellow Wallpaper

yellow_matrix = build_embedding_matrix(yellow, color_table)

In [123]:
# Check if we got the same results as above

yellow_matrix == yellow_color_matrix # True

True

Use the built function to build the matrix you can use for calculating the average color of Dicken's A Tale of Two Cities.

In [128]:
!wget {dickens_url}
!ls -l

--2020-05-16 01:11:18--  https://www.gutenberg.org/files/98/98-0.txt
Resolving www.gutenberg.org (www.gutenberg.org)... 152.19.134.47, 2610:28:3090:3000:0:bad:cafe:47
Connecting to www.gutenberg.org (www.gutenberg.org)|152.19.134.47|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 804335 (785K) [text/plain]
Saving to: ‘98-0.txt’


2020-05-16 01:11:19 (1.97 MB/s) - ‘98-0.txt’ saved [804335/804335]

total 1712
-rw-r--r-- 1 root root  51186 Apr 18 14:40 1952-0.txt
-rw-r--r-- 1 root root 804335 Mar 19  2018 98-0.txt
-rw-r--r-- 1 root root 883160 May  1 07:50 pg345.txt
drwxr-xr-x 1 root root   4096 May 13 16:29 sample_data
drwxr-xr-x 4 root root   4096 May 15 21:48 uo_puddles


In [0]:
with open ('98-0.txt', 'r') as f:
  dickens = f.read()

In [0]:
dickens_matrix = build_embedding_matrix(dickens, color_table)

In [0]:
avg_color = meanv(dickens_matrix)

In [132]:
ordered_embeddings(avg_color, color_table)[:10]

[[12.272728816373057, 'dark taupe'],
 [14.961457974171816, 'cocoa'],
 [16.458517047363525, 'greyish brown'],
 [17.920407343606808, 'dull brown'],
 [19.65895044742039, 'grey brown'],
 [21.21055110632527, 'dirt'],
 [24.630416790742295, 'dark mauve'],
 [24.877166111255395, 'dirt brown'],
 [29.25547441759158, 'brownish'],
 [29.602296007881918, 'brownish grey']]