continuing from [class 6](../class-6/nlp-demo.ipynb)

[spacy visualizers for sentence diagramming](https://spacy.io/usage/visualizers)

In [1]:
import spacy

In [2]:
import sys
!{"/Users/samheckle/Documents/school/spring\ 2022/env/bin/python3"} -m spacy download en_core_web_md

Collecting en-core-web-md==3.2.0
  Using cached https://github.com/explosion/spacy-models/releases/download/en_core_web_md-3.2.0/en_core_web_md-3.2.0-py3-none-any.whl (45.7 MB)
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_md')


In [3]:
nlp = spacy.load('en_core_web_md')

In [4]:
doc = nlp(open("frankenstein.txt").read())

[penn pos list](https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html)

In [14]:
nouns = [item.text for item in doc if item.tag_ == 'NN']
past_tense_verbs = [item.text for item in doc if item.tag_ == 'VBD']
adj = [item.text for item in doc if item.tag_ == 'JJ']
interjections = [item.text for item in doc if item.tag_ == 'UH' and len(item.text)>1]

In [28]:
# remove duplicates by creating a set from the list and then turning it back into list
dedup_nouns = list(set(nouns))

In [15]:
import random

In [16]:
'the ' + random.choice(nouns) + ' ' + random.choice(past_tense_verbs)

'the way advanced'

In [17]:
# removing morphology so that 'to be' is not included 
# makes language more interesting because 'to be' is not interesting in the context of creative writing
past_tense_verbs = [item.text for item in doc if item.tag_ == 'VBD' and item.lemma_ != 'be']

In [18]:
adv = [item.text for item in doc if item.tag_ == 'RB']

In [19]:
import tracery 
from tracery.modifiers import base_english

In [27]:
rules = {
    'origin': ['#interjection.capitalize# #nounphrase# #verbphrase#', '#adverb# #nounphrase# #verbphrase#', '#nounphrase# #verbphrase# then #verbphrase#'],
    'nounphrase': ['the #noun#', '#noun.a#', 'the #adj# #noun#', 'the #noun# and the #noun#', 'the #noun# that #verbphrase#'],
    'verbphrase': ['#verb#', '#verb# #nounphrase#', '#verb# #adverb#', '#verb# #nounphrase# #adverb#'],
    'noun': nouns,
    'adj': adj,
    'verb': past_tense_verbs,
    'adverb': adv,
    'interjection': interjections
}
for i in range(5):
    grammar = tracery.Grammar(rules)
    grammar.add_modifiers(base_english)
    print(grammar.flatten('#origin#'))

Alas the air presented a tale
Alas the human and the removal resolved
a remorse had then sounded
Oh the sun and the time beheld the fervour that said there
the impenetrable hope perceived a bed barbarously then entered


# word vectors

[notes from allison](https://github.com/aparrish/rwet/blob/master/understanding-word-vectors.ipynb)

1. first step in machine learning process is breaking down something into core attributes
2. plot attributes
3. calculate distances between each attribute

In [34]:
# colors as vectors
import json
import numpy as np
color_data = json.loads(open("xkcd.json").read())

In [32]:
# hex uses base 16 and second param in int() is telling it to use base 16
def hex_to_int(s):
    s = s.lstrip("#")
    return np.array([int(s[:2], 16), int(s[2:4], 16), int(s[4:6], 16)])

In [35]:
colors = dict()
for item in color_data['colors']:
    colors[item["color"]] = hex_to_int(item["hex"])

In [36]:
colors['olive']

array([110, 117,  14])

"numpy" - library for doing vector math

In [37]:
np.array([4,5]) + np.array([1,1])

array([5, 6])

In [38]:
np.array([4,5]) * 2

array([ 8, 10])

## vector math with colors

In [39]:
from numpy.linalg import norm
def distance(a,b):
    return norm(a-b)

In [40]:
# forumla works for as many dimensions as needed
distance(colors['red'], colors['green'])

273.70787347096905

In [41]:
distance(colors['red'], colors['pink'])

232.76812496559748

In [44]:
(colors['cyan'] + colors['blue']) / 2

array([  1.5, 161. , 239. ])

## find closest item

approximate nearest neighbors

In [47]:
import sys
!{"/Users/samheckle/Documents/school/spring\ 2022/env/bin/python3"} -m pip install simpleneighbors

Collecting simpleneighbors
  Downloading simpleneighbors-0.1.0-py2.py3-none-any.whl (12 kB)
Installing collected packages: simpleneighbors
Successfully installed simpleneighbors-0.1.0


In [46]:
import sys
!{'/Users/samheckle/Documents/school/spring\ 2022/env/bin/python3'} -m pip install annoy==1.16.3

Collecting annoy==1.16.3
  Downloading annoy-1.16.3.tar.gz (644 kB)
[2K     [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m644.3/644.3 KB[0m [31m7.5 MB/s[0m eta [36m0:00:00[0m[36m0:00:01[0mm eta [36m0:00:01[0m
[?25h  Preparing metadata (setup.py) ... [?25ldone
[?25hBuilding wheels for collected packages: annoy
  Building wheel for annoy (setup.py) ... [?25ldone
[?25h  Created wheel for annoy: filename=annoy-1.16.3-cp38-cp38-macosx_10_9_x86_64.whl size=68931 sha256=57c3d8a325a67f727625f9330f938d5da2083a6aa6d46d3d9374dc4637d13a4a
  Stored in directory: /Users/samheckle/Library/Caches/pip/wheels/93/66/00/3527630e17462dcb505b4688f787b40bc020268237d54e5e79
Successfully built annoy
Installing collected packages: annoy
Successfully installed annoy-1.16.3


In [48]:
from simpleneighbors import SimpleNeighbors

In [49]:
color_lookup = SimpleNeighbors(3, 'euclidean')
for name, vec in colors.items():
    color_lookup.add_one(name, vec)
color_lookup.build()

In [50]:
color_lookup.nearest(colors['red'])

['red',
 'fire engine red',
 'bright red',
 'tomato red',
 'cherry red',
 'scarlet',
 'vermillion',
 'orangish red',
 'cherry',
 'lipstick red',
 'darkish red',
 'neon red']

In [51]:
'red' in color_lookup.corpus

True

## color magic

In [53]:
color_lookup.nearest(colors['purple'] - colors['red'])

['cobalt blue',
 'royal blue',
 'darkish blue',
 'true blue',
 'royal',
 'prussian blue',
 'dark royal blue',
 'deep blue',
 'marine blue',
 'deep sea blue',
 'darkblue',
 'twilight blue']

## interlude: a love poem that loses its way

In [54]:
import random
red = colors['red']
blue = colors['blue']
for i in range(14):
    rednames = color_lookup.nearest(red)
    bluenames = color_lookup.nearest(blue)
    print("Roses are " + rednames[0] + ", violets are " + bluenames[0])
    red = colors[random.choice(rednames[1:])]
    blue = colors[random.choice(bluenames[1:])]

Roses are red, violets are blue
Roses are bright red, violets are vivid blue
Roses are neon red, violets are primary blue
Roses are cherry, violets are ultramarine
Roses are deep pink, violets are royal blue
Roses are red violet, violets are royal
Roses are berry, violets are deep blue
Roses are mulberry, violets are navy blue
Roses are purple red, violets are dark
Roses are red wine, violets are very dark purple
Roses are berry, violets are eggplant
Roses are red purple, violets are midnight purple
Roses are purple red, violets are deep violet
Roses are dark magenta, violets are eggplant purple


# word vectors for realsies

distributional semantics = 'linguistic items with similar distributions have similar meanings'

eg. 
```It was really cold yesterday.
It will be really warm today, though.
It'll be really hot tomorrow!
Will it be really cool Tuesday?```

[GloVe](https://nlp.stanford.edu/projects/glove/)

In [55]:
import spacy

In [56]:
nlp = spacy.load('en_core_web_md')

In [None]:
nlp.vocab['kitten'].vector

In [58]:
def vec(s):
    return nlp.vocab[s].vector

In [59]:
!curl -L -O https://raw.githubusercontent.com/aparrish/wordfreq-en-25000/main/wordfreq-en-25000-log.json

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  898k  100  898k    0     0  3105k      0 --:--:-- --:--:-- --:--:-- 3231k


In [60]:
import json
prob_lookup = dict(json.load(open("./wordfreq-en-25000-log.json")))

In [61]:
prob_lookup['me']

-5.7108

In [62]:
import math
math.exp(prob_lookup['me'])

0.0033100234666365628

In [63]:
lookup = SimpleNeighbors(300)
for word in prob_lookup.keys():
    if nlp.vocab[word].has_vector:
        lookup.add_one(word, vec(word))

In [64]:
lookup.build()

In [65]:
lookup.nearest(vec('basketball'))

['basketball',
 'volleyball',
 'lacrosse',
 'football',
 'soccer',
 'baseball',
 'softball',
 'hockey',
 'tennis',
 'racket',
 'badminton',
 'athletic']