# [Chapter 3] Vectors and Text Similarity

In [7]:
import numpy as np
from numpy import dot
from numpy.linalg import norm

### Listing 3.1
In this example, we explore Ranking two documents for the query "Apple Juice". We present the query as a feature vector, as well as the documents.

#### Text Content
*Query*: "Apple Juice"

*Document 1*: 
```Lynn: ham and cheese sandwhich, chocolate cookie, ice water.
Brian: turkey avocado sandwhich, plain potato chips, apple juice
Mohammed: grilled chicken salad, fruit cup, lemonade```

*Document 2*: ```Orchard Farms apple juice is premium, organic apple juice  made from the freshest apples and never from concentrate. It has received the regional award for best apple juice three years in a row.```

#### Dense Vectors
If we consider a vector with each keyword as a feature (48 terms total):
```[a, and, apple, apples, avocado, award, best, brian, cheese, chicken, chips, chocolate, concentrate, cookie, cup, farms, for, freshest, from, fruit, grilled, ham, has, ice, in, is, it, juice, lemonade, lynn, made, mohammed, never, orchard, organic, plain, potato, premium, received, regional, row, salad, sandwhich, the, three, turkey, water, years]```


Then our query becomes the 48-feature vector, where the `apple` and `juice` features both exist:
Query:      ```[0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]```




In [10]:
query_vector = np.array([0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

The corresponding vectors for our documents are as follows:

In [11]:
doc1_vector = np.array([0, 1, 1, 0, 1, 0, 0, 1, 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0])
doc2_vector = np.array([1, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 1, 1, 0, 0, 1])

#### Similarity
To rank our documents, we then just need to calculate the cosine between each document and the query, 
which will become the relevance score for each document.

In [10]:
def cos_sim(vector1,vector2):
  return dot(vector1, vector2)/(norm(vector1)*norm(vector2))

In [9]:
doc1_score = cos_sim(query_vector, doc1_vector)
doc2_score = cos_sim(query_vector, doc2_vector)

print("Relevance Scores:\n doc1: " + str(doc1_score) + "\n doc2: " + str(doc2_score))


NameError: name 'doc1_vector' is not defined

Interesting... Both documents received exactly the same relevance score, even though the documents contain lengthy vectors with very different content. It might not be immediately obvious, but let's simplify the calculation by focusing only on the features that matter.

#### Sparse Vectors
The key to understanding the calculation is understanding that the only features that matter are the ones shared between the query and a document. All other features (words appearing in documents that don't match the query) have zero impact on whether one document is ranked higher than another. As such, we can simplify our calculations significantly by creating sparse vectors that only include the terms present in the query.

In [11]:
sparse_query_vector = [1, 1] #[apple, juice]
sparse_doc1_vector = [1, 1]
sparse_doc2_vector = [1, 1]

doc1_score = cos_sim(sparse_query_vector, sparse_doc1_vector)
doc2_score = cos_sim(sparse_query_vector, sparse_doc2_vector)

print("Relevance Scores:\n doc1: " + str(doc1_score) + "\n doc2: " + str(doc2_score))

Relevance Scores:
 doc1: 0.9999999999999998
 doc2: 0.9999999999999998


In fact, you'll notice several very interesting things:
1. This simplified sparse vector calculation still shows both `doc1` and `doc2` returning equivalent relevance scores, since they both match all the words in the query.
2. Even though the absolute score between the dense vector similarity (0.282842712474619) and the sparse vector similarity (0.9999999999999998) are different due to normalization, the scores are still the same relative to each other (equal to each other in this case).
3. The feature weights for the two query terms (`apple`, `juice`) are exactly the same between the query and each of the documents, resulting in a cosine score of 1.0.

The problem here, of course, is that the features in the vector only signifies IF the word `apple` or `juice` exists, not how well each document actually represents either of the terms.

In [25]:
doc1_tf_vector = [1, 1] #[apple:1, juice:1]
doc2_tf_vector = [3, 4] #[apple:3, juice:4]

#query should represent the "best possible" match, so we include the "top possible score" for each term in the query vector.
#we could alternatively normalize the scores in the documents
query_vector = np.maximum.reduce([doc1_tf_vector, doc2_tf_vector]) #[3, 4]

doc1_score = cos_sim(query_vector, doc1_tf_vector)
doc2_score = cos_sim(query_vector, doc2_tf_vector)

print("Relevance Scores:\n doc1: " + str(doc1_score) + "\n doc2: " + str(doc2_score))

Relevance Scores:
 doc1: 0.9899494936611665
 doc2: 1.0


#### Term Frequency
In Section 3.1.? we learn about Term Frequency (TF). The following example demonstrates how term frequency helps with our text-based sparse vector similarity scoring.

*Document 1:* ```The interesting thing is that the person in the wrong made the right decision in the end.```

*Document 2:* ```My favorite book is the cat in the hat, which is about a crazy cat who breaks into a house creates a crazy afternoon for two kids.```

*Document 3:* ```My neighbors let the stray cat stay in their garage, which resulted in my favorite hat that I let them borrow being ruined.```

Let's map these into their corresponding (sparse) vector representations and calculate a similarity score:


In [34]:
term_counts = {"doc1": {"the": 5, "cat": 0, "in": 2, "hat": 0},
          "doc2": {"the": 2, "cat": 2, "in": 1, "hat": 1},
          "doc3": {"the": 1, "cat": 1, "in": 1, "hat": 1}}

In [37]:
#[the, cat, in, the, hat]
doc1_tf_vector = [term_counts["doc1"]["the"], term_counts["doc1"]["cat"], term_counts["doc1"]["in"], term_counts["doc1"]["the"], term_counts["doc1"]["hat"]]
doc2_tf_vector = [term_counts["doc2"]["the"], term_counts["doc2"]["cat"], term_counts["doc2"]["in"], term_counts["doc2"]["the"], term_counts["doc2"]["hat"]]
doc3_tf_vector = [term_counts["doc3"]["the"], term_counts["doc3"]["cat"], term_counts["doc3"]["in"], term_counts["doc3"]["the"], term_counts["doc3"]["hat"]]

print ("doc1_vector: [" + ", ".join(map(str,doc1_tf_vector)) + "]")
print ("doc2_vector: [" + ", ".join(map(str,doc2_tf_vector)) + "]")
print ("doc3_vector: [" + ", ".join(map(str,doc3_tf_vector)) + "]\n")
                   
#query vector contains the max value for each term, since this yields the highest similarity score
query_vector = np.maximum.reduce([doc1_tf_vector, doc2_tf_vector, doc3_tf_vector]) # [5, 2, 2, 5, 1]

doc1_score = cos_sim(query_vector, doc1_tf_vector)
doc2_score = cos_sim(query_vector, doc2_tf_vector)
doc3_score = cos_sim(query_vector, doc3_tf_vector)

print("Relevance Scores:\n doc1: " + str(doc1_score) + "\n doc2: " + str(doc2_score)+ "\n doc3: " + str(doc3_score))

doc1_vector: [5, 0, 2, 5, 0]
doc2_vector: [2, 2, 1, 2, 1]
doc3_vector: [1, 1, 1, 1, 1]

Relevance Scores:
 doc1: 0.956689206214921
 doc2: 0.9394501508629485
 doc3: 0.8733337646093731


While we at least receive different relevance scores now for each document based upon the number of times each term matches, the ordering of the results doesn't necessarily match our intuition about which documents are the best matches.

Intuitively, we would instead expect the following ordering:
1. doc2 (is about the book _The Cat in the Hat_ )
2. doc3 (matches all of the words `the`, `cat`, `in`, and `hat`
3. doc1 (only matches the words `the` and `in`, even though it contains them many times).

The problem here, of course, is that since every occurrence of any word is considered just as important, the more times ANY term appears, the more relevant that document becomes. In this case, *doc1* is getting the highest score, because it contains 12 total term matches (`the` ten times, `in` two times), which more total term matches than any other document.

Your intuition is probably screaming right "Yeah, but nobody really cares about the words `the` and `in`. It's obvious that the words `cat` and `hat` should be given the most weight here!"

And you would be right. Let's modify our scoring calculation to fix this.


### Inverse Document Frequency (IDF)


*Document Frequency (DF)* for a term is defined as the total number of document in the search engine that contain the term, and it serve as a good measure for how important a term is. The intuition here is that more specific or rare words (like `cat` and `hat`) tend to be more important than more common words (like `the` and `in`).

$$DF(t\ in\ d)=\sum_{d\ in\ c} d.contains(t)\ ?\ 1\ :\ 0$$

Since we would like words which are more important to get a higher score, we take an inverse of the document frequency (IDF), typically defined through the following function:

$$IDF(t\ in\ d)=1 + log (\ totalDocs\ /\ (\ DF(t)\ +\ 1\ )\ )$$

In our query for `the cat in the hat`, a vector of IDFs would thus look as follows:

In [38]:
#[the, cat, in, hat]
df_map = {"the": 8, "cat": 3, "in":4, "hat":2}
totalDocs = 3

def idf(term):
    return 1 + np.log(totalDocs / (df_map[term] + 1) )

#same for both queries and documents; IDF is term-dependent, not document dependent
idf_vector = np.array([idf("the"), idf("cat"), idf("in"), idf("the"), idf("hat")])

print ("idf_vector: [" + ", ".join(map(str,idf_vector)) + "]")

idf_vector: [-0.09861228866810978, 0.7123179275482191, 0.4891743762340093, -0.09861228866810978, 1.0]


### TF-IDF
We now have the two principle components of text-based relevance ranking:
- TF (measures how well a term describes a document)
- IDF (measures how important each term is)

Most search engines, and many other data science applications, leverage a combination of each of these factors as the basis for textual similarity scoring, using the following function:

$$TF\_IDF = TF * IDF^2$$

With this formula in place, we can finally calculate a relevance score (that weights both number of occurrences and usefulness of terms) for how well each of our documents match our query:

In [45]:
def tf(term_count):
    return np.sqrt(term_count)

In [46]:
#[the, cat, in, the, hat]
query_tfidf = [tf(1)*idf("the"), tf(1)*idf("cat"), tf(1)*idf("in"), tf(1)*idf("the"), tf(1)*(idf("hat"))]

doc1_tfidf = [
               tf(term_counts["doc1"]["the"]) * idf("the") + 
               tf(term_counts["doc1"]["cat"]) * idf("cat") +
               tf(term_counts["doc1"]["in"]) * idf("in") +
               tf(term_counts["doc1"]["the"]) * idf("the") +
               tf(term_counts["doc1"]["hat"]) * idf("hat")]

doc2_tfidf = [
               tf(term_counts["doc2"]["the"]) * idf("the") + 
               tf(term_counts["doc2"]["cat"]) * idf("cat") +
               tf(term_counts["doc2"]["in"]) * idf("in") +
               tf(term_counts["doc2"]["the"]) * idf("the") +
               tf(term_counts["doc2"]["hat"]) * idf("hat")]

doc3_tfidf = [
               tf(term_counts["doc3"]["the"]) * idf("the") + 
               tf(term_counts["doc3"]["cat"]) * idf("cat") +
               tf(term_counts["doc3"]["in"]) * idf("in") +
               tf(term_counts["doc3"]["the"]) * idf("the") +
               tf(term_counts["doc3"]["hat"]) * idf("hat")]

print("Relevance Scores:\n doc1: " + str(doc1_tfidf) + "\n doc2: " + str(doc2_tfidf) + "\n doc3: " + str(doc3_tfidf))

Relevance Scores:
 doc1: [0.2507894754780836]
 doc2: [2.2176263779920133]
 doc3: [2.0042677264460087]


### Finally!
Finally our search results make intuitive sense! `doc2` gets the highest score, since it matches the most important words the most, followed by `doc3`, which contains all the words, but not as many times, followed by `doc1`, which only contains an abundance of insignificant words.

This TF-IDF calculation is at the heart of many search engine relevance calculations, including the default algorithms - called BM25 - used by both Apache Solr and Elasticsearch. In addition, it is possible to match on much more thatn just text keywords - modern search engines enable dynamically specifying boosts of fields, terms, and functions, which enables full control over the relevance scoring calculation.

We'll introduce each of these in the next workbook: [Controlling Relevance](2.ch3-controlling-relevance.ipynb)