# SI 618: Data Manipulation and Analysis
## 07 - Natural Language Processing
### Dr. Chris Teplovs, School of Information, University of Michigan
<small><a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by-nc-sa/4.0/88x31.png" /></a> This work is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/">Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License</a>.

# Outline for today
- announcements
  - Teplovs Office Hours this week: 3:45-5:00pm
  - no HTML,  no grade
- regular expressions redux
- ```spaCy```
    - Cleaning the data
    - Extracting linguistic features
- ```Word2Vec```
    - Vector representation of words
    - Word similarities
    - Vector algebra for semantics

# Why learn NLP?
- Natural language = human language
- We use language to learn about the world
- How machines understand human langauge?
- How can we quantify the meaning of language?

## Applications?
- Probabily any service that uses text as information
- Search engine, SNS
    - What's the document about?
    - How do you determine the similarity?
- Virtual assistants: Alexa, Google Assistant, Cortana, etc. 
    - Understand the semantic information from your speech from parsed text
- Biology, genetics
    - Genetic information / DNA sequence as text
    - Draw networks of proteins/molecules from vast amount of scientific papers 

# Regular Expressions

**Note:** we have included a [cheat sheet for regular expressions](python-regular-expressions-cheat-sheet.pdf).

Regular expressions are simply a way to find sequnces of characters within strings.  Let's use, as our text, "Bereft", a poem by Robert Frost:

In [None]:
bereft = """Where had I heard this wind before
Change like this to a deeper roar?
What would it take my standing there for,
Holding open a restive door,
Looking down hill to a frothy shore?
Summer was past and the day was past.
Sombre clouds in the west were massed.
Out on the porch's sagging floor,
Leaves got up in a coil and hissed,
Blindly striking at my knee and missed.
Something sinister in the tone
Told me my secret my be known:
Word I was in the house alone
Somehow must have gotten abroad,
Word I was in my life alone,
Word I had no one left but God."""

To make our lives simpler (we'll discuss why this is the case in class), we're going to strip the newlines from the passage:

In [None]:
bereft = bereft.replace('\n',' ')

In [None]:
bereft

Let's say we wanted to find all the occurrences of the word "alone".   We could use  plain old string functions:

In [None]:
bereft.find('alone')

In [None]:
if 'alone' in bereft:
    print('Yup, found it')

In [None]:
bereft.count('alone')

Ok, now try it yourself:  
### <font color="magenta">Q1: How many times does the word ```was``` appear in the poem?</a>

In [None]:
# insert your code here

So far, so good.  Now let's make things a bit more interesting.  How many words are there that contain the letters ```one``` ?

In [None]:
bereft.count('one')

In [None]:
import re

In [None]:
re.findall('one',bereft)

But what if we wanted to know the words that contained ```one``` instead of just the count?  Enter regular expressions!

In [None]:
re.findall('[a-z]+one',bereft)

### Some useful online resources:

* www.debuggex.com
* www.regexr.com

In [None]:
re.search('[a-z]*one',bereft)

In [None]:
match = re.search('[a-z]*one', bereft)

In [None]:
if match:
    print("Found it!")

In [None]:
if match:
    print("Found it!")
    print(match.group(0))

### Match Groups

In the above example, we used ```match.group(0)``` to extract the entire match.

Match groups also allow you to extract only certain parts of the match.  In the previous example, say we wanted to know which letters preceded the letters 'one'.  We could use match groups, specified by paretheses, to extract only certain parts.

In [None]:
match = re.search('([a-z]*)one', bereft)

In [None]:
if match:
    print("Found it!")
    print(match.group(0))
    print(match.group(1))

How would we extract all the letters that precede *one*?  Use ```re.finditer()```

In [None]:
matches = re.finditer('([a-z]*)one', bereft)

In [None]:
for match in matches:
    print(match.group(0),match.group(1))

In [None]:
re.split(',',bereft)

### <font color="magenta">Q2: Experiment with various regular expressions such as \W, \w, \s, \S to see how the poem can be split.

In [None]:
#Insert your code here

## How about a few rounds of regex golf?

### <font color="magenta">Q3: See how well you can do at your tables: https://alf.nu/RegexGolf</a></font>
Record your final score below



## Applying regex to pandas DataFrames (from last class) 

As usual, let's load up some data:

In [None]:
import pandas as pd

In [None]:
reviews = pd.read_csv('data/amazon_food_reviews.zip')

Let's take a really small sample, just so we can experiment with the various 

In [None]:
reviews_sample = reviews.head(10)

In [None]:
reviews_sample

Let's review some basic string functionality from Pandas that can be applied to any Series or Index:

In [None]:
reviews_sample.ProfileName.str.lower()

In [None]:
reviews_sample.ProfileName.str.upper()

In [None]:
reviews_sample.Summary.str.len()

Remember, the ```columns``` attribute of a DataFrame is an Index object, which means that we can use str operators on the column names:

In [None]:
reviews_sample.columns

In [None]:
reviews_sample.columns.str.lower()

Notice that the "User Id" column of the dataframe looks weird:  it has a space in the middle *and* at the end.  Columns that are named like that will invariable trip us up in downstream (i.e. later) analyses, so it's wise to correct them now.  Something like the following can help:

In [None]:
reviews_sample.columns.str.strip().str.lower().str.replace(' ','_')

And we can assign that back to the columns attribute to actually rename the columns:


In [None]:
reviews_sample.columns = reviews_sample.columns.str.strip().str.lower().str.replace(' ','_')

In [None]:
reviews_sample

### Splitting and Replacing Strings

Sometimes, we want to split strings into lists.  We might want to do that with the "summary" column:

In [None]:
reviews_sample.productid.str.split('00')

In [None]:
reviews_sample.productid.str.split('00').str.get(1)

Equivalently:

In [None]:
reviews_sample.productid.str.split('00').str[1]

### Replace (regex time!)

In [None]:
reviews_sample.summary.str.lower().str.replace('dog','health')

In [None]:
reviews_sample.summary.str.lower().str.replace('dog|taffy','health')

### Extracting Substrings

In [None]:
reviews_sample.summary.str.extract(r'(Dog)')

In [None]:
reviews_sample.summary.str.extract(r'(Dog|Taffy)')

In [None]:
reviews_sample.summary.str.extract(r'(Dog|[Tt]affy)')

In [None]:
# returns a Series
reviews_sample.summary.str.extract(r'(Dog|[Tt]affy)', expand = False)

In [None]:
reviews_sample.summary.str.extractall(r'(Dog|[Tt]affy)')

In [None]:
reviews_sample.summary.str.extractall(r'(as)')

### Testing for Strings that Match or Contain a Pattern

In [None]:
reviews_sample.text

In [None]:
pattern = r'[Gg]ood'

In [None]:
reviews_sample.text.str.contains(pattern)

In [None]:
reviews_sample.text.str.match(pattern)

In [None]:
pattern = r'.*[Gg]ood.*'

In [None]:
reviews_sample.text.str.match(pattern)

#### Helpful resources:
- Pandas text documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/text.html
- Regex Cheat Sheet: https://regexr.com/

### <font color="magenta">Q4: How many rows from the Amazon Food Reviews data set contain HTML tags in the ```text``` column?</font>

In [None]:
# Add your code here

### <font color="magenta">Q5: Remove all HTML tags from the Amazon Food Reviews text column and save the results to a column called text_no_html.

In [None]:
# Add your code here

## NOTE: Install the spaCy and gensim libraries now.  Windows users will need to implement some work-arounds to get spaCy to work properly.

# spaCy

In [None]:
import pandas as pd
import numpy as np
import re
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

- Fast, and extensible NLP package for Python
- <https://spacy.io/>

In [None]:
import spacy

In [None]:
# You will need to do this only once
# ! python -m spacy download en

In [None]:
# loading up the language model: English
# Windows users will need to follow the instructions in Slack to modify the next line
nlp = spacy.load('en')

# Cleaning Text Data

In [None]:
# from https://en.wikipedia.org/wiki/Portal:History
sentences = """History (from Greek ἱστορία, historia, meaning "inquiry, knowledge acquired by investigation") is the study of the past as it is described in written documents. Events occurring before written record are considered prehistory. It is an umbrella term that relates to past events as well as the memory, discovery, collection, organization, presentation, and interpretation of information about these events. Scholars who write about history are called historians.

History can also refer to the academic discipline which uses a narrative to examine and analyse a sequence of past events, and objectively determine the patterns of cause and effect that determine them. Historians sometimes debate the nature of history and its usefulness by discussing the study of the discipline as an end in itself and as a way of providing "perspective" on the problems of the present.

Stories common to a particular culture, but not supported by external sources (such as the tales surrounding King Arthur), are usually classified as cultural heritage or legends, because they do not show the "disinterested investigation" required of the discipline of history. Herodotus, a 5th-century BC Greek historian is considered within the Western tradition to be the "father of history", and, along with his contemporary Thucydides, helped form the foundations for the modern study of human history. Their works continue to be read today, and the gap between the culture-focused Herodotus and the military-focused Thucydides remains a point of contention or approach in modern historical writing. In East Asia, a state chronicle, the Spring and Autumn Annals was known to be compiled from as early as 722 BC although only 2nd-century BC texts survived.

Ancient influences have helped spawn variant interpretations of the nature of history which have evolved over the centuries and continue to change today. The modern study of history is wide-ranging, and includes the study of specific regions and the study of certain topical or thematical elements of historical investigation. Often history is taught as part of primary and secondary education, and the academic study of history is a major discipline in university studies."""

### Section goal: calculate the frequency of each word
- See which words are more frequent.
- Generate more meaningful summary for the above paragraph.

## Lowering the case

In [None]:
type(sentences)

In [None]:
sentences

In [None]:
sent_low = sentences.lower()

In [None]:
sent_low

## Removing punctuation and special characters

#### Exclude special characters one by one

In [None]:
# from https://www.programiz.com/python-programming/examples/remove-punctuation
punctuations = '''!()-[]{};:'"\,<>./?@#$%^&*_~''' # list of special characters you want to exclude
sent_low_pnct = ""
for char in sent_low:
    if char not in punctuations:
        sent_low_pnct = sent_low_pnct + char

sent_low_pnct

#### Alternatively, we can use regular expression to remove punctuation:
- So we don't have to list up all possible special characters that we want to remove
- https://docs.python.org/3.4/library/re.html
- https://en.wikipedia.org/wiki/Regular_expression

In [None]:
import re
sent_low_pnct2 = re.sub(r'[^\w\s]', '', sent_low)

In [None]:
sent_low_pnct2

- However, special character ```\n``` (linebreak) still exists in both cases. Let's remove these additionally.

In [None]:
import os
os.linesep

In [None]:
sent_low_pnct = sent_low_pnct.replace(os.linesep, "")
sent_low_pnct

### $\rightarrow$ 3 possible ways to replace characters!

### <font color='magenta'> Q6. How would you remove numbers from the paragraph? </font>

In [None]:
# put your code here

## Removing stop words

- Stop words usually refers to the most common words in a language
    - No single universial stopwords
    - Often stopwords are removed to improve the performance of NLP models
    - https://en.wikipedia.org/wiki/Stop_words
    - https://en.wikipedia.org/wiki/Most_common_words_in_English

#### Import the list of stop words from ```spaCy```

In [None]:
from spacy.lang.en.stop_words import STOP_WORDS

In [None]:
np.array(STOP_WORDS)

#### Goal: We are going to count the frequency of each word from the paragraph, to see which words can be used to represent the paragraph's content. 

#### What if we do not remove stopwords?

In [None]:
from collections import Counter

- Note that our paragraph is stored as a single string object...

In [None]:
sent_low_pnct

- Split the paragraph into a list of words

In [None]:
words = sent_low_pnct.split()

- Count the words from the list
- Words that can occur in any kind of paragraphs...?

In [None]:
Counter(words).most_common(10)

In [None]:
plt.figure(figsize=(45,10))
sns.countplot(words, order=pd.Series(words).value_counts().index)
# sns.countplot(words_nostop, order=[counted[0] for counted in Counter(words_nostop).most_common()])
plt.xticks(rotation=90)
plt.show()

(double click the plot to enlarge)

#### When we removed stopwords:

In [None]:
# split sentence into words
words_nostop = list()
for word in words:
    if word not in STOP_WORDS:
        words_nostop.append(word)
# words_nostop = [word for word in words if word not in STOP_WORDS]

- More comprehensible, and unique list or words!

In [None]:
Counter(words_nostop).most_common(10)

In [None]:
plt.figure(figsize=(45,10))
sns.countplot(words_nostop, order=pd.Series(words_nostop).value_counts().index)
# sns.countplot(words_nostop, order=[counted[0] for counted in Counter(words_nostop).most_common()])
plt.xticks(rotation=90)
plt.show()

(double click the plot to enlarge)

### <font color='magenta'> Q7. Based on the word frequency results, what was the paragraph about? </font>

(type in your response here)

# Extracting linguistic features from spaCy

## Tokenize
- Token: a semantic unit for analysis
    - (Loosely) equal term for word
        - ```sent_low_pnct.split()```
    - Tricky cases
        - aren't $\rightarrow$ ![](https://nlp.stanford.edu/IR-book/html/htmledition/img88.png) ![](https://nlp.stanford.edu/IR-book/html/htmledition/img89.png) ? ![](https://nlp.stanford.edu/IR-book/html/htmledition/img86.png) ?
        - O'Neil $\rightarrow$ ![](https://nlp.stanford.edu/IR-book/html/htmledition/img83.png) ? ![](https://nlp.stanford.edu/IR-book/html/htmledition/img84.png) ![](https://nlp.stanford.edu/IR-book/html/htmledition/img81.png) ?
        - https://nlp.stanford.edu/IR-book/html/htmledition/tokenization-1.html
- In ```spaCy```:
    - Many token types, like word, puntuation symbol, whitespace, etc.

### Let's dissect the sentence!

- initiating the ```spaCy``` object 

In [None]:
# examples partially taken from https://nlpforhackers.io/complete-guide-to-spacy/
import spacy
nlp = spacy.load('en')

- Our sentence: "Hello World!"
    - Pass the sentence string to the ```spaCy``` object ```nlp```

In [None]:
doc = nlp("Hello World!")

- The sentence is considered as a short document.

In [None]:
print(type(doc), doc)

- As importing the sentence string above, ```spaCy``` splited the sentence into tokens (tokenization!)

In [None]:
for i,token in enumerate(doc):
    print(i, token)

- With index information (location from the sentence) of each token

| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10| 11|
|---|---|---|---|---|---|---|---|---|---|---|---|
| H | e | l | l | o | _ | W | o | r | l | d | ! |

In [None]:
for i, token in enumerate(doc):
    print(i, token.text, token.idx) 


- And many more!
    - https://spacy.io/api/token#attributes

In [None]:
doc = nlp("What did you do during the study break   ?")

print("text \t idx \t lemma \t lower \t is_punct \t is_space \t shape \t POS")
for token in doc:
    print("{0}\t{1}\t{2}\t{3}\t{4}\t{5}\t{6}\t{7}".format(
        token.text,
        token.idx,
        token.lemma_,
        token.lower_,
        token.is_punct,
        token.is_space,
        token.shape_,
        token.pos_
    ))


## Sentence detection

- For the document with multiple sentences, we would need to separate between each sentences.
- In ```spaCy```, the job is more convinient (and would cause less mistatkes) than using regular expression

### <font color='magenta'> Q8. How would you separate sentences? What's your intuition? </font>

(type in your response here)

- Our multiple sentence document: 

In [None]:
doc_multsent = "These are apples. Those are oranges from N.Y.C. and...? How about pineapples? Not carrots!!!"

- in regular expression...

In [None]:
import re
sentences = re.split(r"(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?)\s", doc_multsent) # how would I remember this pattern without Google/StackOverflow?
for i, sent in enumerate(sentences):
    print(i, sent)

- in ```spaCy```!

In [None]:
# same document, but initiate as the spaCy object...
doc = nlp(doc_multsent)

- Sentences are stored as a generator object
    - Instead of storing sentences as a list, each sentence is stored as a item in the generator object 
    - Iteratable (i.e., can be used in a for loop)
    - More efficient memory use
    - https://wiki.python.org/moin/Generators

In [None]:
doc.sents

- Prining sentences with the index number

In [None]:
for i, sent in enumerate(doc.sents):
    print(i, sent)

### <font color='magenta'> Q9. Separate sentences in the following paragraph, and print sentences with the index number. </font>

In [None]:
# from https://en.wikipedia.org/wiki/Portal:History
sentences = """History (from Greek ἱστορία, historia, meaning "inquiry, knowledge acquired by investigation") is the study of the past as it is described in written documents. Events occurring before written record are considered prehistory. It is an umbrella term that relates to past events as well as the memory, discovery, collection, organization, presentation, and interpretation of information about these events. Scholars who write about history are called historians.

History can also refer to the academic discipline which uses a narrative to examine and analyse a sequence of past events, and objectively determine the patterns of cause and effect that determine them. Historians sometimes debate the nature of history and its usefulness by discussing the study of the discipline as an end in itself and as a way of providing "perspective" on the problems of the present.

Stories common to a particular culture, but not supported by external sources (such as the tales surrounding King Arthur), are usually classified as cultural heritage or legends, because they do not show the "disinterested investigation" required of the discipline of history. Herodotus, a 5th-century BC Greek historian is considered within the Western tradition to be the "father of history", and, along with his contemporary Thucydides, helped form the foundations for the modern study of human history. Their works continue to be read today, and the gap between the culture-focused Herodotus and the military-focused Thucydides remains a point of contention or approach in modern historical writing. In East Asia, a state chronicle, the Spring and Autumn Annals was known to be compiled from as early as 722 BC although only 2nd-century BC texts survived.

Ancient influences have helped spawn variant interpretations of the nature of history which have evolved over the centuries and continue to change today. The modern study of history is wide-ranging, and includes the study of specific regions and the study of certain topical or thematical elements of historical investigation. Often history is taught as part of primary and secondary education, and the academic study of history is a major discipline in university studies."""

In [None]:
# put your code here

## POS tagging

- I want to find words with particular part-of-speech!
- Different part-of-speech words carry different information
    - e.g., noun (subject), verb (action term), adjective (quality of the object) 
- https://spacy.io/api/annotation#pos-tagging

- Yelp review!

In [None]:
# from https://www.yelp.com/biz/ajishin-novi?hrid=juA4Zn2TX7845vNFn4syBQ&utm_campaign=www_review_share_popup&utm_medium=copy_link&utm_source=(direct)
doc = nlp("""One of the best Japanese restaurants in Novi. Simple food, great taste, amazingly price. I visit this place a least twice month.""")

### <font color='magenta'> Q10a. What can you infer from this review? </font>
- What type of the restaurant?
- Location?
- What did the reviewer liked about?
- How often did the person visit the place?
- Any other information?

(type in your response here)

- multiple sentences exist in a document

In [None]:
for i, sent in enumerate(doc.sents):
    print(i, sent)

- Question: which words are adjective (ADJ)?

In [None]:
for i, sent in enumerate(doc.sents):
    print("__sentence__:", i)
    print("_token_ \t _POS_")
    for token in sent:
        print(token.text, "\t", token.pos_)

## Gramatical dependency
- Words are gramatically related in a sentence.
- Conveys much semantic information about the sentential context.

- And dependency relationships also can be extracted as string

In [None]:
for token in doc:
    print('"' + token.text + '", ', token.pos_, list(token.ancestors), (token.dep_))

```spaCy``` follows the ```ClearNLP``` annotations for dependency parsing
- https://github.com/clir/clearnlp-guidelines/blob/master/md/specifications/dependency_labels.md

### <font color='magenta'> Q10b: (Now answer the same question based on the POS/dependency parsing results) What can you infer from this review? </font>
- What type of the restaurant?
- Location?
- What did the reviewer liked about?
- How often did the person visit the place?
- Any other information?

(type in your response here)

# Word embedding

- So far, we seen how we can extract some interesting syntactic characteristics from text from using ```spacy```
- It extracted the characteristics, but did not indicate what it means
- Can machines understand semantic relationship between words?

- Distributional semantics
    - Representing semantic information of words in a geometric semantic space
        - Different relationship between words: explained by geometric relationship between words 
        - e.g., Related words are located closer to each other; 
    - And it's often called as *word embedding*

#### Word2Vec
- Developed by [Mikolov et al., 2013](https://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf)
- Represent the meaning of the words as a vector
    - Vector: numeric array
    - Output of a neural network model that predicts the next word
- Surprisingly, many different semantic informations can be represented from word vectors of ```Word2Vec```
- (More explanation in here: https://www.tensorflow.org/tutorials/representation/word2vec)

<img src="https://www.tensorflow.org/images/softmax-nplm.png" width="400">

![](https://www.tensorflow.org/images/linear-relationships.png)

### Let's try with some example: words in a semantic space
$\rightarrow$ https://projector.tensorflow.org

### <font color='magenta'> Q11. Record any interesting findings from TensorFlow Projector page</font>

(type in your response here)

## OK. Let's try some more details in our local machines!
- Download the [pretrained model](https://drive.google.com/open?id=10GXpuviDJVa-k8ZmiYX3BVABNDRaA6tg) and place it in the same folder as this notebook
- We are using [gensim](https://radimrehurek.com/gensim/) package this time

In [None]:
# ! conda install -y gensim

In [None]:
import gensim

In [None]:
# from https://github.com/eyaler/word2vec-slim
w2v_mod = gensim.models.KeyedVectors.load_word2vec_format("GoogleNews-vectors-negative300-SLIM.bin", binary=True)

## Calculating similarity between words

- Q: What's similarity between *school* and *student*?

- the word vector for *school* looks like this:

In [None]:
w2v_mod['school']

In [None]:
len(w2v_mod['school'])

- and the word vector for *student* looks like this:

In [None]:
w2v_mod['student']

- the similarity between two word vectors is:

In [None]:
w2v_mod.similarity('school', 'student')

### Methods for measuring similarity

<table>
<tr>
    <td><img src="https://nickgrattan.files.wordpress.com/2014/06/screenhunter_76-jun-10-08-36.jpg" width="400"></td>
    <td><img src="https://nickgrattan.files.wordpress.com/2014/06/screenhunter_77-jun-10-08-36.jpg" width="400"></td>
    <td><img src="https://nickgrattan.files.wordpress.com/2014/06/screenhunter_77-jun-10-08-37.jpg" width="400"></td>
</tr>
</table>

- Euclidean distance
    - The most common use of distance
    - $ \sqrt{(x_1 - x_2)^2 + (y_1 - y_2)^2} $

In [None]:
# (images from https://nickgrattan.wordpress.com/2014/06/10/euclidean-manhattan-and-cosine-distance-measures-in-c/)
np.sqrt(np.power((12-5), 2) + np.power((14-11), 2))

- Manhattan distance
    - Distance = the sum of differences in the grid
    - $|x_1 - x_2| + |y_1 - y_2|$

In [None]:
np.abs(12-5) + np.abs(14-11)

- Cosine similarity 
    - Often used to measure similarity between vectors
    - $cos(\theta) = \frac{\sum_{i=1}^{n} A_i B_i }{\sqrt{\sum_{i=1}^{n} A_i^2 } \sqrt{\sum_{i=1}^{n} B_i^2 }}$ 
    - https://en.wikipedia.org/wiki/Cosine_similarity

In [None]:
a = np.array([12, 14])
b = np.array([5, 11])
a.dot(b) / (np.sqrt(np.sum(np.power(a, 2))) * np.sqrt(np.sum(np.power(b, 2))))

In [None]:
# (image from http://blog.christianperone.com/2013/09/machine-learning-cosine-similarity-for-vector-space-models-part-iii/)

![](http://blog.christianperone.com/wp-content/uploads/2013/09/cosinesimilarityfq1.png)

- Cosine simiarity can go from -1 to 1
- But usually, we deal with 0 to 1 scores for comparing words in ```Word2Vec```

### <font color='magenta'> Q12a. What's the cosine similarity between *school* and *tiger*? </font>
- How would you interprete the results?

(type in your response here)

In [None]:
# put your code here

### <font color='magenta'> Q12b. Try some other words. Any other interesting findings? </font>
- Give 3 more examples.
- How would you interprete the results?

In [None]:
# put your code here

(type in your response here)

## Analogy from word vectors

<img src="https://www.tensorflow.org/images/linear-relationships.png" width="800">

#### Can we approximate the relationship between words by doing - and + operations?

- $woman - man + king \approx ?$
- How this works?
    - $woman:man \approx x:king $
    - $\rightarrow woman - man \approx x - king $
    - $\rightarrow woman - man + king \approx x$
    - List top-10 words ($x$) that can solve the equation!

In [None]:
w2v_mod.most_similar(positive=['woman', 'king'], negative=['man'])

- $Spain - Germany + Berlin \approx ?$
    - $\rightarrow Spain - Germany \approx x -  Berlin $

In [None]:
w2v_mod.most_similar(positive=['Spain', 'Berlin'], negative=['Germany'])

### <font color='magenta'> Q13. Any other interesting examples? </font>
- Give 3 more examples.
- How would you interprete the results?

(type in your response here)

In [None]:
# put your code here

## Constructing interpretable semantic scales 

- So far, we saw that word vectors effectively carries (although not perfect) the semantic information.
- Can we design something more interpretable results from using the semantic space?

- Let's re-try with real datapoints in [here](https://projector.tensorflow.org): *politics* words in a *bad-good* PCA space

In [None]:
from scipy import spatial
 
def cosine_similarity(x, y):
    return(1 - spatial.distance.cosine(x, y))

- Can we regenerate this results with our embedding model?

### Let's plot words in the 2D space
- Using Bad & Good axes
- Calculate cosine similarity between an evaluating word (violence, discussion, and issues) with each scale's end (bad, and good)

In [None]:
pol_words_sim_2d = pd.DataFrame([[cosine_similarity(w2v_mod['violence'], w2v_mod['good']), cosine_similarity(w2v_mod['violence'], w2v_mod['bad'])],
                                 [cosine_similarity(w2v_mod['discussion'], w2v_mod['good']), cosine_similarity(w2v_mod['discussion'], w2v_mod['bad'])],
                                 [cosine_similarity(w2v_mod['issues'], w2v_mod['good']), cosine_similarity(w2v_mod['issues'], w2v_mod['bad'])]],
                                index=['violence', 'discussion', 'issues'], columns=['good', 'bad'])

In [None]:
pol_words_sim_2d

- If we plot this:

In [None]:
sns.scatterplot(x='good', y='bad', data=pol_words_sim_2d, hue=pol_words_sim_2d.index)

- violence: less good, more bad
- discussion: less bad, more good
- issues: both bad and good

### Can we do this in an 1D scale?
(bad) --------------------?---- (good)

- First, let's create the vector for *bad-good* scale

In [None]:
scale_bad_good = w2v_mod['good'] - w2v_mod['bad']

- Calculate the cosine similarity score of the word *violence* in the *bad-good* scale 
    - $sim(V(violence), V(bad) - V(good))$

In [None]:
cosine_similarity(w2v_mod['violence'], scale_bad_good)

- $sim(V(discussion), V(bad) - V(good))$

In [None]:
cosine_similarity(w2v_mod['discussion'], w2v_mod['good'] - w2v_mod['bad'])

- $sim(V(issues), V(bad) - V(good))$

In [None]:
cosine_similarity(w2v_mod['issues'], w2v_mod['good'] - w2v_mod['bad'])

In summary, as displayed in Embedding Projector, words *violence*, *discussion*, and *issues* are located in the *bad-good* scale as following:

In [None]:
pol_words_sim = pd.DataFrame([cosine_similarity(w2v_mod['violence'], w2v_mod['good'] - w2v_mod['bad']),
                              cosine_similarity(w2v_mod['discussion'], w2v_mod['good'] - w2v_mod['bad']),
                              cosine_similarity(w2v_mod['issues'], w2v_mod['good'] - w2v_mod['bad'])],
                             index=['violence', 'discussion', 'issues'], columns=['cos_sim'])

In [None]:
pol_words_sim

In [None]:
ax = sns.barplot(x=pol_words_sim.index, y=pol_words_sim.cos_sim)
ax.set(ylabel="bad_good scale")
plt.show()

- *Violence* is more close to the *bad* end of the scale, while *discussion* is located towards the *good* end of the scale. *Issues* is located between those two words in the *bad-good* scale. 

### <font color='magenta'> Q14.  Select different scale and a set of words. How words are represented in the your suggested semantic scale? </font>
- Why did you selected the particular scale and words? what's your interpretation?

(type in your response here)

In [None]:
# put your code here

- more to read about this method:     http://bookworm.benschmidt.org/posts/2015-10-25-Word-Embeddings.html