# Vector spcae model



Vector spaces are fundamental in many applications in NLP. If you were to represent a word, document, tweet, or any form of text, you will probably be encoding it as a vector. These vectors are important in tasks like information extraction, machine translation, and chatbots. Vector spaces could also be used to help you identify relationships between words 


## Euclidean distance 

$$d(\vec{v},\vec{w}) = \sqrt{\sum_{i=1}^n (v_i - w_i)^2}$$

where:
- $d(\vec{v},\vec{w})$ represents the Euclidean distance between vectors $\vec{v}$ and $\vec{w}$

Issues with Euclidean distance:
One of the issues with euclidean distance is that it is not always accurate and sometimes we are not looking for that type of similarity metric. For example, when comparing large documents to smaller ones with euclidean distance one could get an inaccurate result. Look at the diagram below:


<div align="center">
  <img src="images/issue_with_euclidian_distance.png" alt="Alt text" width="500" height="200" />
</div>

## Cosine Similarity
Before getting into the cosine similarity function remember that the norm of a vector is defined as:

$$\|\vec{v}\| = \sqrt{\sum_{i=1}^n v_i^2}$$

The cosine similarity function is defined as:

$$\cos(\theta) = \frac{\vec{v} \cdot \vec{w}}{\|\vec{v}\| \|\vec{w}\|}$$

where:
- $\theta$ is the angle between vectors $\vec{v}$ and $\vec{w}$


## PCA (Principal Component Analysis)

Principal component analysis is an unsupervised learning algorithm which can be used to reduce the dimension of your data. As a result, it allows you to visualize your data. It tries to combine variances across features. 

Note that when doing PCA on this data, you will see that oil & gas are close to one another and town & city are also close to one another. To plot the data you can use PCA to go from d>2 dimensions to d=2. 

Eigenvector: the resulting vectors, also known as the uncorrelated features of your data

Eigenvalue: the amount of information retained by each new feature. You can think of it as the variance in the eigenvector. 

Also each eigenvalue has a corresponding eigenvector. The eigenvalue tells you how much variance there is in the eigenvector. Here are the steps required to compute PCA: 

Steps to Compute PCA: 
  - Mean normalize your data 
  - Compute the covariance matrix 
  - Compute SVD on your covariance matrix. This returns [USV]=svd(Σ). The three matrices U, S, V are drawn above. U is labelled with eigenvectors, and S is labelled with eigenvalues. 
  - You can then use the first n columns of vector U, to get your new data by multiplying XU[:,0:n]. 

  Steps to Compute PCA:

  $$\Sigma = \frac{1}{m}X^TX$$

  $$[U,S,V] = svd(\Sigma)$$

  $$X_{new} = XU[:,0:n]$$

  where:
  - $X$ is the data matrix
  - $m$ is the number of samples
  - $n$ is the number of features
  - $U$ is the eigenvector matrix
  - $S$ is the eigenvalue matrix
  - $V$ is the eigenvector matrix

  



usa  = (5,6) 
wash = (10,5)  


In [15]:

import numpy as np


def eu_dist(v, w):
    # return 
    return np.linalg.norm(v-w),np.dot(v, w) / (np.linalg.norm(v) * np.linalg.norm(w))

usa = np.array([5, 6])
wash = np.array([10, 5])
ag = np.array([9,1])
X = usa - wash + ag
# print(eu_dist(usa,wash)) 
# print(eu_dist(X,np.array([3,1]))) 
# print(eu_dist(X,np.array([4,3]))) 
# print(eu_dist(X,np.array([5,5]))) 

v = np.array([1,0,-1])
w = np.array([2,8,1])
X = np.array([3,1,4])
print(eu_dist(v, w))
print(eu_dist(X, v))

(8.306623862918075, 0.08512565307587484)
(5.477225575051661, -0.1386750490563073)


In [None]:

def compute_pca(X, n_components=2):
    """
    Input:
        X: of dimension (m,n) where each row corresponds to a word vector
        n_components: Number of components you want to keep.
    Output:
        X_reduced: data transformed in 2 dims/columns + regenerated original data
    pass in: data as 2D NumPy array
    """
    # mean center the data
    X_demeaned = X - np.mean(X,axis=0)

    # calculate the covariance matrix
    covariance_matrix = np.cov(X_demeaned, rowvar=False)
    
    # calculate eigenvectors & eigenvalues of the covariance matrix 
    eigen_vals, eigen_vecs = np.linalg.eigh(covariance_matrix)

    # sort eigenvalue in increasing order (get the indices from the sort)
    idx_sorted = np.argsort(eigen_vals)
    
    # reverse the order so that it's from highest to lowest. 
    idx_sorted_decreasing = idx_sorted[::-1]
    
    # sort the eigen values by idx_sorted_decreasing
    eigen_vals_sorted = eigen_vals[idx_sorted_decreasing]
    
    # sort eigenvectors using the idx_sorted_decreasing indices
    eigen_vecs_sorted = eigen_vecs[:,idx_sorted_decreasing]

    # select the first n eigenvectors (n is desired dimension 
    # of rescaled data array, or n_components)
    eigen_vecs_subset = eigen_vecs_sorted[:,0:n_components]

    # transform the data by multiplying the transpose of the eigenvectors with the transpose of the de-meaned data
    # Then take the transpose of that product.
    # transform the data 
    X_reduced = np.dot(eigen_vecs_subset.transpose(), X_demeaned.transpose()).transpose()

    return X_reduced


