# A Brief Primer on Cosine Similarity

**Sources** - https://en.wikipedia.org/wiki/Cosine_similarity and Stackoverflow

## Definition
Cosine similarity is a measure of similarity between two non-zero vectors of an inner product space that measures the cosine of the angle between them. Below is a list of break-down definitions.

##### Measure of Similarity 
A real-valued function that quantifies the similarity between two objects. For cosine similarity, the 2 objects are 2 non-zero vectors.This measure is usually the inverse of a distance metric: it takes on large values for similar vectors and either zero or a negative value for very dissimilar vectors. 

Example - Euclidean Distance

##### Inner Product Space
An inner product space is a vector space together with an inner product on it. An inner product in turn is a generalization of dot product. A vector space is simply a collection of vectors.


##### Application in Information Retrieval
For example, in information retrieval, each term assigned a different dimension and a document is characterised by a vector where the value in each dimension corresponds to the number of times the term appears in the document. 
Cosine similarity then gives a useful measure of how similar two documents are likely to be in terms of their subject matter.
##### Advantages
Low Complexity especially when dealing with sparse vectors

## Formula
#### The cosine of two non-zero vectors can be derived by using the Euclidean dot product formula:
$${\text similarity = \cos \theta = \displaystyle \dfrac {\mathbf {A} \cdot \mathbf {B}}{\left\|\mathbf {A} \right\|\left\|\mathbf {B} \right\|}}$$

The resulting similarity ranges from −1 meaning exactly opposite, 1 meaning exactly the same, and 0 indicating decorrelation.  Any in-between values indicate intermediate similarity or dissimilarity. For text matching, Cosine similarity can be seen as a method of normalizing document length during comparison.

##### Normalization
In the simplest sense, this means adjusting values measured on different scales to a notionally common scale, often prior to averaging. In more complicated cases, normalization might mean more sophisticated adjustments, where the intention is to bring entire probability distributions of adjusted values into alignment.

Example: In educational assessment, there may be an intention to align distributions to a normal distribution. This is a type of normalization.

###### *Note*: To display Latex inside a Jupyter cell, use *$$* before and after the Latex lines.

#  Cosine Similarity vs. Cosine Distance

### Defining Cosine Distance
The term "cosine similarity" is sometimes used to refer to a different definition of similarity. 

The normalized angle between the vectors is yet another formal distance metric and can be calculated from the similarity score defined above. This angular distance metric can then be used to compute a similarity function bounded between 0 and 1, inclusive.

When the vector elements may be positive or negative:

$${\displaystyle {\text{angular distance}}= 1 - {\frac {\cos ^{-1}({\text{cosine similarity}})}{\pi }}}$$

When the vector elements are always positive:

$${\displaystyle {\text{angular distance}}= 1 - {\frac {2\cdot \cos ^{-1}({\text{cosine similarity}})}{\pi }}}$$

In [1]:
# Consider the following 2 lists
data_I = [41, 5, 187, 10]
data_II = [30, 4, 182, 15]

# Cosine Similarity using pure Python

In [2]:
def dot(A,B): 
    return (sum(a*b for a,b in zip(A,B)))

def cosine_similarity(a,b):
    """Here math.sqrt(var) is written as (var**0.5)"""
    return dot(a,b) / ( (dot(a,a) **0.5) * (dot(b,b) ** 0.5) )

print('Pure Python: '+ str(cosine_similarity(data_I, data_II)))

Pure Python: 0.9982020836071523


# Cosine Similarity using Numpy Only

In [3]:
from numpy import dot
from numpy.linalg import norm

cos_similarity_numpy = dot(data_I, data_II)/(norm(data_I)*norm(data_II))
print('Numpy:\nCosine similarity is ' + str(cos_similarity_numpy))

Numpy:
Cosine similarity is 0.998202083607


# Cosine Similarity using Scipy

In [4]:
from scipy import spatial
cos_distance= spatial.distance.cosine(data_I, data_II)
print('Scipy:\nCosine distance is ' + str(cos_distance))

cos_similarity_scipy = 1 - cos_distance
print('Cosine similarity is ' + str(cos_similarity_scipy))

Scipy:
Cosine distance is 0.00179791639285
Cosine similarity is 0.998202083607


# Cosine Similarity using Sklearn

In [5]:
import sklearn
from sklearn.metrics.pairwise import cosine_similarity
cos_similarity_sklearn = cosine_similarity([data_I], [data_II])
print('Sklearn:\nCosine similarity is ' + str(cos_similarity_sklearn))

Sklearn:
Cosine similarity is [[ 0.99820208]]
