#**Cosine Similarity**

Cosine similarity has been used to assess how similar documents are with each other. Where Euclidean distance measures the magnitude of the separation between 2 vectors, cosine similarity gives a measure of the angle between 2 multidimensional vectors. It is very similar to correlation where the [cosine similarity](https://brenocon.com/blog/2012/03/cosine-similarity-pearson-correlation-and-ols-coefficients/) between centered versions of x and y, again bounded between -1 and 1.
Cosine similarity is a metric used to measure how similar the documents are irrespective of their size. 

The maths behind this measure are derived from the Euclidean dot product:

</br>

<img src="https://wikimedia.org/api/rest_v1/media/math/render/svg/fb9fc371e46e02d0ef51e781e7397629425856b5" class="mwe-math-fallback-image-inline" aria-hidden="true" style="vertical-align: -0.838ex; width:22.631ex; height:2.843ex;" alt="{\displaystyle \mathbf {A} \cdot \mathbf {B} =\left\|\mathbf {A} \right\|\left\|\mathbf {B} \right\|\cos \theta }">

this can be converted to similarity by doing some simple algebra:

<img src="https://wikimedia.org/api/rest_v1/media/math/render/svg/1d94e5903f7936d3c131e040ef2c51b473dd071d" class="mwe-math-fallback-image-inline" aria-hidden="true" style="vertical-align: -7.338ex; width:52.215ex; height:14.343ex;" alt="{\displaystyle {\text{similarity}}=\cos(\theta )={\mathbf {A} \cdot \mathbf {B}  \over \|\mathbf {A} \|\|\mathbf {B} \|}={\frac {\sum \limits _{i=1}^{n}{A_{i}B_{i}}}{{\sqrt {\sum \limits _{i=1}^{n}{A_{i}^{2}}}}{\sqrt {\sum \limits _{i=1}^{n}{B_{i}^{2}}}}}},}">

Cosine similarity is generally used when the magnitude of the vector does not matter. The term $\vec{A}\|\cos{\theta}$ tells where the projection vector $\vec{A}$ lands on vector $\vec{B}$, Figure 1

![alt text](https://www.computing.dcu.ie/~amccarren/mcm_images/Dot_Product.png)

Figure 1: the projection of $\vec{A}$ on $\vec{B}$, Wilipedia

</br>

The following code from Github implements Scikit learns cosine_similarity. You should notice how two vectors going in the same direction but with differing magnitudes have the same cosine similarity with $\vec{X}$ and $\vec{Y}$:

$$\vec{z} = [1~ 1~ 1~ 1]$$

$$\vec{z_2} = [100 ~ 100~ 100~ 100]$$

but when we get the euclidean distance there is a vast difference.

Again, play with these measures. Would you expect z and z2 to be correlated with each other? Place your thoughts on the comments board.







In [0]:
# http://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.cosine_similarity.html
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

# The usual creation of arrays produces wrong format (as cosine_similarity works on matrices)
x = np.array([2,3,1,0])
y = np.array([2,3,0,0])

# Need to reshape these
x = x.reshape(1,-1)
y = y.reshape(1,-1)

# Or just create as a single row matrix
z = np.array([[1,1,1,1]])
z2 = np.array([[100,100,100,100]])


# Now we can compute similarities
print(cosine_similarity(x,z)) 
print(cosine_similarity(x,z2)) 
print(cosine_similarity(y,z)) 
print(cosine_similarity(y,z2)) 


[[0.80178373]]
[[0.80178373]]
[[0.69337525]]
[[0.69337525]]


In [0]:
from sklearn.metrics.pairwise import euclidean_distances
X = np.array([2,3,1,0])
X = x.reshape(1,-1)
z = np.array([[1,1,1,1]])
z2 = np.array([[100,100,100,100]])
# distance between rows of X

print(euclidean_distances(X, z))
print(euclidean_distances(X, z2))

[[2.44948974]]
[[197.01268995]]
