# [ Chapter 3 - Ranking and Content-based Relevance ] 
# Vectors and Text Similarity

In [97]:
import sys

sys.path.append('..')
import numpy as np
from aips import num2str, tokenize, vec2str
from numpy import dot
from numpy.linalg import norm

### Listing 3.1
In this example, we explore Ranking two documents for the query "Apple Juice". We present the query as a feature vector, as well as the documents.

#### Text Content
*Query*: "Apple Juice"

*Document 1*: 
```Lynn: ham and cheese sandwich, chocolate cookie, ice water.
Brian: turkey avocado sandwich, plain potato chips, apple juice
Mohammed: grilled chicken salad, fruit cup, lemonade```

*Document 2*: ```Orchard Farms apple juice is premium, organic apple juice made from the freshest apples, never from concentrate. Its juice has received the regional award for best apple juice three years in a row.```

#### Dense Vectors
If we consider a vector with each keyword as a feature (48 terms total):
```[a, and, apple, apples, avocado, award, best, brian, cheese, chicken, chips, chocolate, concentrate, cookie, cup, farms, for, freshest, from, fruit, grilled, ham, has, ice, in, is, its, juice, lemonade, lynn, made, mohammed, never, orchard, organic, plain, potato, premium, received, regional, row, salad, sandwich, the, three, turkey, water, years]```


Then our query becomes the 48-feature vector, where the `apple` and `juice` features both exist:
Query:      ```[0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]```




In [98]:
query_vector = np.array([0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

The corresponding vectors for our documents are as follows:

In [99]:
doc1_vector = np.array([0, 1, 1, 0, 1, 0, 0, 1, 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0])
doc2_vector = np.array([1, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 1, 1, 0, 0, 1])

#### Similarity
To rank our documents, we then just need to calculate the cosine between each document and the query, 
which will become the relevance score for each document.

In [100]:
def cosine_similarity(vector1, vector2):
    return dot(vector1, vector2) / (norm(vector1) * norm(vector2))

In [101]:
doc1_score = cosine_similarity(query_vector, doc1_vector)
doc2_score = cosine_similarity(query_vector, doc2_vector)

print(f"""Relevance Scores:
 doc1: {num2str(doc1_score)}
 doc2: {num2str(doc2_score)}""")

Relevance Scores:
 doc1: 0.2828
 doc2: 0.2828


Interesting... Both documents received exactly the same relevance score, even though the documents contain lengthy vectors with very different content. It might not be immediately obvious, but let's simplify the calculation by focusing only on the features that matter.

#### Sparse Vectors
The key to understanding the calculation is understanding that the only features that matter are the ones shared between the query and a document. All other features (words appearing in documents that don't match the query) have zero impact on whether one document is ranked higher than another. As such, we can simplify our calculations significantly by creating sparse vectors that only include the terms present in the query.

### Listing 3.2

In [102]:
query_vector = [1, 1] #[apple, juice]
doc1_vector = [1, 1]
doc2_vector = [1, 1]

doc1_score = cosine_similarity(query_vector, doc1_vector)
doc2_score = cosine_similarity(query_vector, doc2_vector)

print(f"""Relevance Scores:
 doc1: {num2str(doc1_score)}
 doc2: {num2str(doc2_score)}""")

Relevance Scores:
 doc1: 1.0
 doc2: 1.0


In fact, you'll notice several very interesting things:
1. This simplified sparse vector calculation still shows both `doc1` and `doc2` returning equivalent relevance scores, since they both match all the words in the query.
2. Even though the absolute score between the dense vector similarity (0.2828) and the sparse vector similarity (1.0) are different due to normalization, the scores are still the same relative to each other (equal to each other in this case).
3. The feature weights for the two query terms (`apple`, `juice`) are exactly the same between the query and each of the documents, resulting in a cosine score of 1.0.

The problem here, of course, is that the features in the vector only signifies IF the word `apple` or `juice` exists, not how well each document actually represents either of the terms. We'll correct for by introducing the concept of "term frequency".

#### Term Frequency
In Section 3.1.4 we learn about Term Frequency (TF). If we count up the number of times each term appears in our documents, we will get a better understanding of "how well" the document represents those terms:

In [103]:
query = "apple juice"
doc1 = "Lynn: ham and cheese sandwhich, chocolate cookie, ice water.\nBrian: turkey avocado sandwhich, plain potato chips, apple juice\nMohammed: grilled chicken salad, fruit cup, lemonade"
doc2 = "Orchard Farms apple juice is premium, organic apple juice  made from the freshest apples and never from concentrate. Its juice has received the regional award for best apple juice three years in a row."

query_term_occurrences = [tokenize(query).count("apple"), tokenize(query).count("juice")] #[apple:1, juice:1]
doc1_term_occurrences = [tokenize(doc1).count("apple"), tokenize(doc1).count("juice")] #[apple:1, juice:1]
doc2_term_occurrences = [tokenize(doc2).count("apple"), tokenize(doc2).count("juice")] #[apple:3, juice:4]

print(f"query_term_occurrences: {query_term_occurrences}")
print(f"doc1_term_occurrences: {doc1_term_occurrences}")
print(f"doc2_term_occurrences: {doc2_term_occurrences}")


query_term_occurrences: [1, 1]
doc1_term_occurrences: [1, 1]
doc2_term_occurrences: [3, 4]


We can see that the feature values for the terms `apple` and `juice` are now weighted in each vector based upon the number of occurrences in each document. Unfortunately we can't just do a cosine similarity on the raw count of term occurrences, however, because the query only contains one occurrence of each term, whereas we would consider documents with multiple occurrences of each term to likely be more similar. Let's test it out and see the problem:

### Listing 3.3

In [104]:
doc1_tf_vector = [1, 1] #[apple:1, juice:1]
doc2_tf_vector = [3, 4] #[apple:3, juice:4]

query_vector = [1, 1] #[apple:1, juice:1]

doc1_score = cosine_similarity(query_vector, doc1_tf_vector)
doc2_score = cosine_similarity(query_vector, doc2_tf_vector)

print(f"""Relevance Scores:
 doc1: {num2str(doc1_score)}
 doc2: {num2str(doc2_score)}""")

Relevance Scores:
 doc1: 1.0
 doc2: 0.9899


Since our goal is for documents like `doc2` with higher term frequency to score higher, we can overcome these by either using switching from cosine similarity to another scoring function, such as _dot product_ or _Euclidean distance_, that increases as feature weights continue to increase. Let's switch to using the dot product (`a . b`), which is equal to the cosine similarity multiplied by the length of the query vector times the length of the document vector: `a · b = |a| × |b| × cos(θ)`. The dot product will result in documents that contain more matching terms scoring higher, as opposed to cosine similarity, which scores documents higher containing a more similar proportion of matching terms between the query and documents. 

### Listing 3.4

In [105]:
doc1_tf_vector = [1, 1] #[apple:1, juice:1]
doc2_tf_vector = [3, 4] #[apple:3, juice:4]

query_vector = [1, 1] #[apple:1, juice: 1]

doc1_score = dot(query_vector, doc1_tf_vector)
doc2_score = dot(query_vector, doc2_tf_vector)

print(f"""Relevance Scores:
 doc1: {num2str(doc1_score)}
 doc2: {num2str(doc2_score)}""")

Relevance Scores:
 doc1: 2
 doc2: 7


The result rankings now look more inline with our expectations. 

Great - the result rankings now look more inline with our expectations (for this simple example, at least)!

As our feature-weighting calculation are getting more sophisticated, let's move on beyond our initial `"apple juice"` example toward a query and documents with more interesting statistics in terms of intersections and overlaps between terms in the query and terms in the documents.

The following example demonstrates some more useful characteristics that will better help us understand how term frequency helps with our text-based sparse vector similarity scoring.

*Document 1:* ```In light of the big reveal in her interview, the interesting thing is that the person in the wrong probably made a good decision in the end.```

*Document 2:* ```My favorite book is the cat in the hat, which is about a crazy cat in a hat who breaks into a house and creates the craziest afternoon for two kids.```

*Document 3:* ```My careless neighbors apparently let a stray cat stay in their garage unsupervised, which resulted in my favorite hat that I let them borrow being ruined.```

Let's map these into their corresponding (sparse) vector representations and calculate a similarity score:


### Listing 3.5

In [106]:
doc1 = "In light of the big reveal in her interview, the interesting thing is that the person in the wrong probably made a good decision in the end."
doc2 = "My favorite book is the cat in the hat, which is about a crazy cat in a hat who breaks into a house and creates the craziest afternoon for two kids."
doc3 = "My careless neighbors apparently let a stray cat stay in their garage unsupervised, which resulted in my favorite hat that I let them borrow being ruined."
docs = [doc1, doc2, doc3]

In [107]:
def term_count(content, term):
    tokenized_content = tokenize(content)
    term_count = tokenized_content.count(term.lower())
    return float(term_count)

In [108]:
#dot
query = "the cat in the hat"
terms = query.split(" ")
doc_vectors = [[term_count(doc, term) for term in terms] for doc in docs]
query_vector = [1 for term in terms] 
doc_scores = [dot(dv, query_vector) for dv in doc_vectors]
   
print("\nlabels: ", terms)
    
print(f"\nquery vector: [{', '.join(map(num2str,query_vector))}]\n")

for i, doc in enumerate(doc_vectors):
    print(f"doc{i+1} vector: [{', '.join(map(num2str,doc))}]")        

print("\nRelevance Scores:")
for i, score in enumerate(doc_scores):
    print(f" doc{i+1}: {num2str(score)}")


labels:  ['the', 'cat', 'in', 'the', 'hat']

query vector: [1, 1, 1, 1, 1]

doc1 vector: [5.0, 0.0, 4.0, 5.0, 0.0]
doc2 vector: [3.0, 2.0, 2.0, 3.0, 2.0]
doc3 vector: [0.0, 1.0, 2.0, 0.0, 1.0]

Relevance Scores:
 doc1: 14.0
 doc2: 12.0
 doc3: 4.0


Unfortunately, those results don't necessarily match our intuition about which documents are the best matches. Intuitively, we would instead expect the following ordering:
1. doc2 (is about the book _The Cat in the Hat_ )
2. doc3 (matches all of the words `the`, `cat`, `in`, and `hat`
3. doc1 (only matches the words `the` and `in`, even though it contains them many times).

The problem here, of course, is that since every occurrence of any word is considered just as important, the more times ANY term appears, the more relevant that document becomes. In this case, *doc1* is getting the highest score, because it contains 12 total term matches (`the` ten times, `in` two times), which more total term matches than any other document.

To overcome these issues, "term frequency" calculations will typically both normalize for document length (take the total term count divided by document length) and also dampen the effect of additional term occurrences (take the square root of term occurrences).

This gives us the following term frequency calculations:


### Listing 3.6

In [109]:
def tf(content, term):
    tokenized_content = tokenize(content)
    term_count = tokenized_content.count(term.lower())
    vector_length = len(tokenized_content)
    return float(np.sqrt(term_count)) / float(vector_length)

With our updated TF calculation in place, let's calculate our relevance ranking again:

In [110]:
query = "the cat in the hat"
terms = query.split(" ")
doc_vectors = [[tf(doc, term) for term in terms] for doc in docs]
query_vector = [1 for term in terms] 
doc_scores = [dot(dv, query_vector) for dv in doc_vectors]
    
print("Document TF Vector Values:")
for i, doc in enumerate(doc_vectors):
    print(f" doc{i + 1}: [" + ', '.join(map(lambda t : f'tf(doc{i + 1}, "{t}")', terms))+ "]")
print("\nLabels:", terms)
for i, doc in enumerate(doc_vectors):
    print(f"  doc{i+1}: [{', '.join(map(num2str,doc))}]")
print("\nRelevance Scores:")
for i, score in enumerate(doc_scores):
    print(f" doc{i+1}: {num2str(score)}")

Document TF Vector Values:
 doc1: [tf(doc1, "the"), tf(doc1, "cat"), tf(doc1, "in"), tf(doc1, "the"), tf(doc1, "hat")]
 doc2: [tf(doc2, "the"), tf(doc2, "cat"), tf(doc2, "in"), tf(doc2, "the"), tf(doc2, "hat")]
 doc3: [tf(doc3, "the"), tf(doc3, "cat"), tf(doc3, "in"), tf(doc3, "the"), tf(doc3, "hat")]

Labels: ['the', 'cat', 'in', 'the', 'hat']
  doc1: [0.0828, 0.0, 0.0741, 0.0828, 0.0]
  doc2: [0.0559, 0.0456, 0.0456, 0.0559, 0.0456]
  doc3: [0.0, 0.0385, 0.0544, 0.0, 0.0385]

Relevance Scores:
 doc1: 0.2397
 doc2: 0.2486
 doc3: 0.1313


The normalized TF clearly helped, as `doc2` is now ranked the highest, as we would expect. This is mostly because of the dampening effect on number of term occurrences so that each additional term (in `doc1`, which matched `the` and `in` so many times) so that each additional occurrrence contributed less to the feature weight than prior occurrences. Unfortunately, `doc1` is still ranked second highest, so even that wasn't enough to get the better matching `doc3` to the top.

Your intuition is probably screaming right "Yeah, but nobody really cares about the words `the` and `in`. It's obvious that the words `cat` and `hat` should be given the most weight here!"

And you would be right. Let's modify our scoring calculation to fix this.


### Inverse Document Frequency (IDF)


*Document Frequency (DF)* for a term is defined as the total number of document in the search engine that contain the term, and it serve as a good measure for how important a term is. The intuition here is that more specific or rare words (like `cat` and `hat`) tend to be more important than more common words (like `the` and `in`).

$$DF(t\ in\ d)=\sum_{d\ in\ c} d.contains(t)\ ?\ 1\ :\ 0$$

Since we would like words which are more important to get a higher score, we take an inverse of the document frequency (IDF), typically defined through the following function:

$$IDF(t\ in\ d)=1 + log (\ totalDocs\ /\ (\ DF(t)\ +\ 1\ )\ )$$

In our query for `the cat in the hat`, a vector of IDFs would thus look as follows:

### Listing 3.7

In [111]:
def idf(term):
    #Mocked document counts from an inverted index
    df_map = {"the": 9500, "cat": 100, "in":9000, "hat":50}
    totalDocs = 10000
    return 1 + np.log(totalDocs / (df_map[term] + 1) )

terms = ["the", "cat", "in", "the", "hat"]
idf_vector = [idf(term) for term in terms]

print("IDF Vector Values:")
print("  [" + ', '.join(map(lambda t: f'idf("{t}")', terms)) + "]\n")
#print("Labels: ", terms)
print(f"IDF Vector:\n  {vec2str(idf_vector)}")

IDF Vector Values:
  [idf("the"), idf("cat"), idf("in"), idf("the"), idf("hat")]

IDF Vector:
  [1.0512, 5.5952, 1.1052, 1.0512, 6.2785]


### TF-IDF
We now have the two principle components of text-based relevance ranking:
- TF (measures how well a term describes a document)
- IDF (measures how important each term is)

Most search engines, and many other data science applications, leverage a combination of each of these factors as the basis for textual similarity scoring, using the following function:

$$TF\_IDF = TF * IDF^2$$

With this formula in place, we can finally calculate a relevance score (that weights both number of occurrences and usefulness of terms) for how well each of our documents match our query:

### Listing 3.8

In [112]:
def tf_idf(doc, term):
    return tf(doc, term) * idf(term)**2

In [113]:
query = "the cat in the hat"
terms = query.split(" ")
doc_vectors = [[tf_idf(doc, term) for term in terms] for doc in docs]
query_vector = [1 for term in terms]
doc_scores = [[dot(query_vector, dv)] for dv in doc_vectors]

print("Document TF-IDF Vector Calculations")
for i, doc in enumerate(doc_vectors):
    print(f" doc{i + 1}: [" + ', '.join(map(lambda t:
          f'tf_idf(doc{i + 1}, "{t}")',terms)) + "]")

print("\nDocument TF-IDF Vector Scores")
print ("Labels:", terms)
for i, doc in enumerate(doc_vectors):
    print(f"  doc{i + 1}: [{', '.join(map(num2str, doc))}]")

print("\nRelevance Scores:")
for i, score in enumerate(doc_scores):
    print(f" doc{i + 1}: {', '.join(map(num2str, score))}")

Document TF-IDF Vector Calculations
 doc1: [tf_idf(doc1, "the"), tf_idf(doc1, "cat"), tf_idf(doc1, "in"), tf_idf(doc1, "the"), tf_idf(doc1, "hat")]
 doc2: [tf_idf(doc2, "the"), tf_idf(doc2, "cat"), tf_idf(doc2, "in"), tf_idf(doc2, "the"), tf_idf(doc2, "hat")]
 doc3: [tf_idf(doc3, "the"), tf_idf(doc3, "cat"), tf_idf(doc3, "in"), tf_idf(doc3, "the"), tf_idf(doc3, "hat")]

Document TF-IDF Vector Scores
Labels: ['the', 'cat', 'in', 'the', 'hat']
  doc1: [0.0915, 0.0, 0.0905, 0.0915, 0.0]
  doc2: [0.0617, 1.4282, 0.0557, 0.0617, 1.7983]
  doc3: [0.0, 1.2041, 0.0664, 0.0, 1.5161]

Relevance Scores:
 doc1: 0.2735
 doc2: 3.4057
 doc3: 2.7867


### Finally!
Finally our search results make intuitive sense! `doc2` gets the highest score, since it matches the most important words the most, followed by `doc3`, which contains all the words, but not as many times, followed by `doc1`, which only contains an abundance of insignificant words.

This TF-IDF calculation is at the heart of many search engine relevance calculations, including the default algorithms - called BM25 - used by both Apache Solr and Elasticsearch. In addition, it is possible to match on much more than just text keywords - modern search engines enable dynamically specifying boosts of fields, terms, and functions, which enables full control over the relevance scoring calculation.

We'll introduce each of these in the next workbook: [Controlling Relevance](2.controlling-relevance.ipynb)