<h1><center> PPOLS564: Foundations of Data Science </center><h1>
<h3><center> Lecture 11 <br><br><font color='grey'> Trigonometry of Vectors </font></center></h3>

# Concepts For today:

- Vector Dot Product
- Orthogonality
- Projection
- Normalizing vectors
- Example: Comparing the text documents

## Note
In the following lectures, we'll delve into exploring linear algebra. Note that I'll be using some code to help generate some interactive visualizes for some concepts. To use this code yourself, two things must be true: (1) the `bokeh` module must be installed, and (2) the `visualize.py` script must be in the same file director as this notebook and the jupyter notebook must be activated from that location.

Finally, note that these lecture slides are intended to be supplementary to the lectures and readings.

In [1]:
import numpy as np
from visualize import LinearAlgebra as vla

# Vector Multiplication (Vector Dot Product)

Given $\vec{a}, \vec{b} \in \Re^n$



$$ \vec{a} \cdot \vec{b} $$

$$ \begin{bmatrix}  a_1 \\ a_2 \\  \vdots \\ a_n  \end{bmatrix} \cdot 
\begin{bmatrix}  b_1 \\ b_2 \\  \vdots \\ b_n  \end{bmatrix}$$

$$ a_1 b_1 + a_2 b_2 + \dots + a_n b_n $$

$$ \vec{a} \cdot \vec{b} = \sum_{i=1}^n a_i b_i$$

The dot product between two column vectors produces a scalar ($c$).

In [34]:
# Computationally 
a = np.array([1,2])
b = np.array([2,1])

# Two ways to take the dot product using numpy
print(a.dot(b))

# or 

print(np.dot(a,b))

# or

print(a @ b)

4
4
4


## Properties 

|Property| Expression|
|-------------|---------------|
| **Communicative** | $\vec{a} \cdot \vec{b} = \vec{b} \cdot \vec{a} $|
| **Distributive** | $ \vec{v} \cdot (\vec{a} + \vec{b}) = \vec{v} \cdot \vec{a} + \vec{v} \cdot \vec{b} $|
| **Associative** | $ c(\vec{a}) \cdot \vec{b} = c(\vec{a} \cdot \vec{b}) $|

# Magnitude (Length) of a Vector 

What is the length of $\vec{c}$?

In [3]:
# Vector a
c = np.array([1,2])

# Plot the vector
plot = vla()
plot.graph()
plot.vector(c)
plot.show()

Now recall our discussion of unit vectors"

In [4]:
i = np.array([1,0])
j = np.array([0,1])

(1*i) + (2*j) 

array([1, 2])

In [5]:
# Create our scaled unit vectors
a = i
b = 2*j

plot.vector(a)
plot.change_origin(i)
plot.vector(b)
plot.show()

Recall the Pythagorean Theorem

$$ a^2 + b^2 = c^2 $$

$$ \left\| a \right\|^2 + \left\| b \right\|^2 = \left\| c \right\|^2 $$

In [6]:
a.dot(a) + b.dot(b)

5



$$ \begin{bmatrix}  c_1 \\ c_2 \\  \vdots \\ c_n  \end{bmatrix} \cdot 
\begin{bmatrix}  c_1 \\ c_2 \\  \vdots \\ c_n  \end{bmatrix}$$

$$ c_1 c_1 + c_2 c_2 + \dots + c_n c_n $$ 

$$ c_1^2 + c_2^2 + \dots + c_n^2 =  \left\| c \right\|^2 $$ 

$$ \sqrt{\vec{c} \cdot \vec{c}} = \left\| c \right\|  $$ 

For example,

$$ \vec{c} = \begin{bmatrix}  1 \\ 2 \end{bmatrix} $$

$$ \begin{bmatrix}  1 \\ 2 \end{bmatrix} \cdot 
\begin{bmatrix}  1 \\ 2 \\ \end{bmatrix}$$

$$ 1(1) + 2(2) $$ 

$$ 1 + 4 $$ 

$$ \left\| c \right\|^2 = 5$$ 

$$ \sqrt{\left\| c \right\|^2} = \sqrt{5}$$ 

$$ \left\| c \right\| = 2.24 $$

In [7]:
np.sqrt(c.dot(c))

2.23606797749979

In [8]:
np.linalg.norm(c)

2.23606797749979

# Angles between Vectors

In [9]:
a = np.array([4,1])
b = np.array([1,2])

plot.clear().graph(7)
plot.vector(a)
plot.vector(b)
plot.show()

### Law of Cosines

$$ c^2 = a^2 + b^2 - 2ab\cos{\theta} $$

In [10]:
# Let's subtract the two vectors to get 
plot.subtract_vectors(a,b)
plot.show()

$$  \left\|  a - b \right\|^2 = \left\|  a \right\|^2 + \left\| b  \right\|^2 - 2\left\| a \right\| \left\| b \right\|\cos{\theta}  $$

$$ (\vec{a} - \vec{b}) \cdot (\vec{a} - \vec{b}) = \left\|  a \right\|^2 + \left\| b  \right\|^2 - 2\left\| a \right\| \left\| b \right\|\cos{\theta}  $$

$$ \vec{a}\vec{a} - 2(\vec{a} \cdot \vec{b}) + \vec{b}\vec{b} = \left\|  a \right\|^2 + \left\| b  \right\|^2 - 2\left\| a \right\| \left\| b \right\|\cos{\theta}  $$

$$  \left\| a \right\|^2 - 2(\vec{a} \cdot \vec{b}) + \left\| b \right\|^2 = \left\|  a \right\|^2 + \left\| b  \right\|^2 - 2\left\| a \right\| \left\| b \right\|\cos{\theta}  $$

$$ \vec{a} \cdot \vec{b} =  \left\| a \right\| \left\| b \right\|\cos{\theta} $$ 

#### In words, the dot product of two vectors is equal to the product of their lengths times the cosine of the angle between them. 

## Triangle Inequality

An important rule to keep in mind: the sum of two sides of a triangle must always be greater than or equal to the third length. In linear algebra, we take this important property from trigonometry and apply it to N-dimensional space.

$$ \left\| \vec{a} + \vec{b} \right\| \le \left\| \vec{a} \right\| + \left\|\vec{b} \right\| $$

## Orthogonal Vectors 

When the angle between two vectors is 90 degrees (i.e. when the vectors are pointing in the **opposite direction**) the $\cos{\theta} = 0$

In [11]:
a = np.array([4,0])
b = np.array([0,5])

plot.clear().graph(10)
plot.vector(a)
plot.vector(b)
plot.show()

$$ \vec{a} \cdot \vec{b} =  \left\| a \right\| \left\| b \right\|\cos{90}$$ 

$$ \vec{a} \cdot \vec{b} =  \left\| a \right\| \left\| b \right\|0$$ 

$$ \vec{a} \cdot \vec{b} =  0 $$ 

In [12]:
np.dot(a,b)

0

This when we take the dot product between two vectors and they're corresponding dot product is 0, we know that the two vectors are orthogonal to one another.

In [13]:
a = np.array([4,1])
b = np.array([1,5])

plot.clear().graph(10)
plot.vector(a)
plot.vector(b)
plot.show()

In [14]:
np.dot(a,b)

9

### Calculating the cosine

$$ \vec{a} \cdot \vec{b} =  \left\| a \right\| \left\| b \right\|\cos{\theta}$$

$$ \cos{\theta} =  \frac{\vec{a} \cdot \vec{b}}{\left\| a \right\| \left\| b \right\|}$$

In [15]:
def cosine(a,b):
    cos = np.dot(a,b)/(np.sqrt(np.dot(a,a)) * np.sqrt(np.dot(b,b))  )
    return cos

In [16]:
round(cosine(a,b),3)

0.428

In [17]:
# Let's reverse engineer this to get the dot product again.
np.sqrt(a.dot(a)) * np.sqrt(b.dot(b)) * cosine(a,b)

9.0

# Dot product as a projection

We can think of the dot product of two vectors as the length of two vectors that **moves in the same direction**

Imagine we cast a vector down from the tip of $\vec{a}$ onto $\vec{b}$ such that the angle between that vector (which we'll call $\vec{v}$) and vector $\vec{b}$ is orthogonal (90 degrees). 

In [18]:
plot.clear().graph(8)
plot.projection(a,b)
plot.show()

What is the size the "shadow" cast by that vector onto $\vec{b}$?

$$ \vec{v} = \vec{a} - c\vec{b} $$

by design, $\vec{b} \cdot \vec{v} = 0$


$$ (\vec{a}-c\vec{b}) \cdot \vec{b}  = 0 $$ 

$$ \vec{a} \cdot \vec{b} -c\vec{b} \cdot \vec{b}  = 0 $$ 

$$ \vec{a} \cdot \vec{b} = c\vec{b} \cdot \vec{b}  $$ 

$$ \frac{\vec{a} \cdot \vec{b}}{\vec{b} \cdot \vec{b}} = c  $$ 

$$ c = \frac{\vec{a} \cdot \vec{b}}{\left\| b \right\|^2} $$ 


Thus, our "shadow" vector is merely a scaled version of $\vec{b}$

$$ shadow = c\vec{b}$$

$\vec{a}$ is moving in the direction of $\vec{b}$ by  $c\vec{b}$




**Applied**: What is the size of the projection of $\vec{a}$ onto $\vec{b}$ ?

In [33]:
c = np.dot(a,b)/np.dot(b,b)
projection_vector = c*b

projection_vector

array([0.34615385, 1.73076923])

### What is this really?
The projection is equal to the cosine if we normalize the vectors. That is, if we reset the vectors so that their lengths are equal to 1. This puts the vectors onto the unit circle.

To **normalize** a vector, we scale the vector by its length.

$$ \vec{a}_{norm} = \frac{1}{\left\| a \right\|} \vec{a} $$

where 

$$ \left\| \vec{a}_{norm} \right\| = 1 $$

In [19]:
a_norm = 1/np.sqrt(np.dot(a,a))*a
b_norm = 1/np.sqrt(np.dot(b,b))*b

In [20]:
plot.clear().graph(2)
plot.projection(a_norm,b_norm)
plot.show()

In [21]:
c = np.dot(a_norm,b_norm)/np.dot(a_norm,a_norm)
c

0.4280863447390447

In [22]:
cosine(a,b)

0.4280863447390447

# Applied Example: How similar are these two statements?

In [172]:
import pandas as pd
from collections import Counter

In [173]:
descrip1 = "This is a speech given by current President Trump about Trump."
descrip2 = "This is a speech given by former President Obama about Trump."

In [174]:
def tokenize(text=None):
    text = text.lower()
    text = text.replace('.','')
    text_list = text.split()
    return text_list

In [175]:
tokenize(descrip1)

['this',
 'is',
 'a',
 'speech',
 'given',
 'by',
 'current',
 'president',
 'trump',
 'about',
 'trump']

In [176]:
d = Counter(tokenize(descrip1))
for key in d:
    d[key] = [d[key]]
d

Counter({'this': [1],
         'is': [1],
         'a': [1],
         'speech': [1],
         'given': [1],
         'by': [1],
         'current': [1],
         'president': [1],
         'trump': [2],
         'about': [1]})

In [177]:
DTM = pd.DataFrame(d)
DTM

Unnamed: 0,this,is,a,speech,given,by,current,president,trump,about
0,1,1,1,1,1,1,1,1,2,1


In [178]:
def convert_tokens_to_entry(tokens):
    '''
    Converts tokens into count entries for a document term matrix.
    '''
    d = {key:[value] for key,value in Counter(tokens).items()}
    return pd.DataFrame(d)

d = tokenize(descrip1)
convert_tokens_to_entry(d)

Unnamed: 0,this,is,a,speech,given,by,current,president,trump,about
0,1,1,1,1,1,1,1,1,2,1


In [179]:
# Now build a function that does this for a list of texts
def gen_DTM(texts=None):
    '''
    Generate a document term matrix
    '''
    DTM = pd.DataFrame()
    for text in texts:
        tokens = tokenize(text)
        entry = convert_tokens_to_entry(tokens)
        
        # Append (row bind) the current entry onto the existing data frame
        DTM = DTM.append(pd.DataFrame(entry),ignore_index=True,sort=True)
    
    # Fill in any missing values with 0s (i.e. when a word is in one text but not another)
    DTM.fillna(0, inplace=True)
    return DTM

# Test it out!        
gen_DTM([descrip1,descrip2]) 

Unnamed: 0,a,about,by,current,former,given,is,obama,president,speech,this,trump
0,1,1,1,1.0,0.0,1,1,0.0,1,1,1,2
1,1,1,1,0.0,1.0,1,1,1.0,1,1,1,1


How similar are these two statements?

In [180]:
D = gen_DTM([descrip1,descrip2]) 

# We can index the pandas dataframe to draw out a numpy array ( a vector! )
D.iloc[0].values

array([1., 1., 1., 1., 0., 1., 1., 0., 1., 1., 1., 2.])

In [181]:
a = D.iloc[0].values
b = D.iloc[1].values

Let's use cosine similarity to understand the relationship between these two statements.

In [182]:
cosine(a,b)

0.8362420100070909

Pretty similar!

Now, to really run home the intuition, how similar are these two statements?

In [183]:
docs = [
    "On Saturday, Samantha likes to go shopping at the mall.",
    "The results show that the marginal effect of x on y was trivial and overstated by the original authors."
]

In [184]:
D = gen_DTM(docs)
D # note how pandas (and numpy) use the elipses for abreviation.

Unnamed: 0,and,at,authors,by,effect,go,likes,mall,marginal,of,...,"saturday,",shopping,show,that,the,to,trivial,was,x,y
0,0.0,1.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0,...,1.0,1.0,0.0,0.0,1,1.0,0.0,0.0,0.0,0.0
1,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,1.0,1.0,...,0.0,0.0,1.0,1.0,3,0.0,1.0,1.0,1.0,1.0


In [185]:
a = D.iloc[0].values
b = D.iloc[1].values
cosine(a,b) 

0.25298221281347033

Much less a alike! Actually the only real similarity between these two statements is the parts of speech. If we were to clean those out, we'd find that there was very little in common between these two statements. Let's do that!

**Removing Stopwords**

In [186]:
stopwords = pd.read_csv('stop_words.csv')

In [187]:
stopwords.head()

Unnamed: 0,word,lexicon
0,a,SMART
1,a's,SMART
2,able,SMART
3,about,SMART
4,above,SMART


In [188]:
# convert to a list
sw_list = list(stopwords['word'].values)
sw_list[1:5]

["a's", 'able', 'about', 'above']

In [189]:
# Rewrite our token function to clean out these words
def tokenize(text=None):
    text = text.lower()
    text = text.replace('.','')
    text_list = text.split()
    text_list2 = [word for word in text_list if word not in sw_list]
    return text_list2

print(tokenize(docs[0]))
print(tokenize(docs[1]))

['saturday,', 'samantha', 'likes', 'shopping', 'mall']
['results', 'marginal', 'effect', 'trivial', 'overstated', 'original', 'authors']


In [190]:
D = gen_DTM(docs)
a = D.iloc[0].values
b = D.iloc[1].values
cosine(a,b) 

0.0

The two statements are **completely orthogonal**! They go in completely different directions, substantively speaking. 

Given this conceptualization, we could think of any document in this way! Our knowledge of vectors helps us make substantive comparisons between unstructured text. Pretty neat!