## A simple tf-idf implementation:

In [128]:
import math
def tfidf(term, doc, doc_list):
  docs_with_term = [doc for doc in doc_list if term in doc]
  tf = doc.count(term) #term count
  df = len(docs_with_term)/len(doc_list) #document frequency
  idf = math.log(1/df)
  return tf * idf

In [129]:
documents = ["the mouse", "the small cat", "the cheese", "the big cat"]

In [130]:
# TODO: calculate the importance of each word in a document
print("the importance is", tfidf("mouse", "the mouse", documents))
print("the importance is", tfidf("small", "the small cat", documents))
print("the importance is", tfidf("cheese", "the cheese", documents))
print("the importance is", tfidf("cat", "the big cat", documents))

the importance is 1.3862943611198906
the importance is 1.3862943611198906
the importance is 1.3862943611198906
the importance is 0.6931471805599453


In [131]:
def getUniqueVocab(documents):
  #Find unique vocab words
  vocab = []
  for doc in documents:
    vocab.extend(doc.split())
  vocab = set(vocab)
  return vocab

def vectorize(phrase, documents):
  #Find unique vocab words
  vocab = getUniqueVocab(documents)

  #Build vector representation
  result = []
  for word in vocab:
    result.append(tfidf(word, phrase, documents))
  return result

In [132]:
# TODO: lets vectorize some documents!
print(getUniqueVocab(documents))
print(vectorize("the small cat", documents))

{'big', 'mouse', 'cat', 'cheese', 'the', 'small'}
[0.0, 0.0, 0.6931471805599453, 0.0, 0.0, 1.3862943611198906]


----
## TF-IDF with SK-Learn

This is a pretty cude approach to tf-idf. In practice we often want to smooth out the tf-idf computation, and include extra normalization terms.

There are many corner cases to consider, and variations on how we can compute the `tf` term and the `idf` term.

Its often useful to use a library to compute tf-idf, and one is provided in `SK Learn`

In [133]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [134]:
# TODO: create vectorizer and fit_transform our documents (training set)
vectorizer = TfidfVectorizer(min_df=1)
vectorized_docs = vectorizer.fit_transform(documents)

Note that "the" (word #5) has a non-zero value even though it occurs in every document. "The" is still the smallest values (around 0.4), and "cheese"/"mouse" still the largest value (0.88).

For which examples does SK-Learn's TF-IDF formula give "the" the most weight? Why is that? Any advantages to this method?

actual sklearn formulat : tfidf = log(N/df) * ln(1+M/T) where N is the number of occurrences of the word in the corpus, Df is the total number of documents in the collection, M is the number of times the word occurs in this document, and T is the total number of words in this particular document.

We can use `vectorizer.transform` along with `toarray()` to get the vectorized result of a new document:

In [135]:
# TODO: print the vector version of "the big cheese"
print(vectorizer.transform(["the big cheese"]).toarray())

[[0.66338461 0.         0.66338461 0.         0.         0.34618161]]


----
## Your turn:

1. First, can you compute the distance between two vectors. My examples give the sqaured Euclidean distance, but you can use other distance terms.

Here is a stub of `my_dist` to get you started

In [136]:
import numpy as np
def my_dist(v1, v2):
  return np.sum( (v1-v2)**2 )

In [137]:
print(my_dist(np.array([3]),np.array([4])))          #Squared L2 Dist is 1
print(my_dist(np.array([1,1]),np.array([2,2])))      #Squared L2 Dist is 2
print(my_dist(np.array([1,2,3]),np.array([1,-1,7]))) #Squared L2 Dist is 25

1
2
25


If you above distance function works, you can now take the distance between every vector and an new document using code like this:

In [138]:
new_post_vec = vectorizer.transform(["the big cheese"])
[my_dist(new_post_vec.toarray(),train_vec.toarray()) for train_vec in vectorized_docs] #I get [1.7, 1.7, 0.5, 0.7], your results may differ

[np.float64(1.6796869262748297),
 np.float64(1.7374616297112078),
 np.float64(0.5034428121791692),
 np.float64(0.7733760581012943)]

2. Now write a function called `findClosest` which takes as input a string, and returns a string of the document in `documents` which has the closest feature vector.

Again, we'll give you a stub of `findClosest` to help get you started:

In [139]:
def findClosest(promt):
  # TODO: Fix this to be the actual closest index like you find closest in nearest meighbors algorithm  
  closest_id = 0
  min_dist = float('inf')
  for i, vec in enumerate(vectorized_docs):
    dist = my_dist(new_post_vec.toarray(), vec.toarray())
    if dist < min_dist:
      min_dist = dist
      closest_id = i
  return documents[closest_id]

In [140]:
findClosest("the big cheese") #Should probably return "the cheese"

'the cheese'

---
Thought Experiment - Submit your answers as your activity!
 - What would you want to happen when the input is `"a cheesy slice of pizza"`
 - What does happen?
 - How might we fix this?

In [141]:
findClosest("a cheesy slice of pizza")  #TODO: Should this stil return "the cheese"?

'the cheese'

Yes, it does.