### Calculate the TF-IDF Matrix from scratch

The following code includes my implementation to make the TF-IDF matrix. I compare my output with sklearn's output to confirm my intuition. 

**Note: If you are looking at this notebook because you came from [PracticeLSA2](https://github.com/zaynpatel/LTA2061/blob/main/PracticeLSA2.ipynb) refer to this notebook for the TF-IDF matrix calculation *instead of* the one in my original notebook. In addition, when we define the term frequency we look at the raw frequency of terms in a document divided by the number of terms in the document. This provides us a *relative frqeuency*. The intuition behind TF-IDF, in general, and the IDF definition is still correct in [PracticeLSA2](https://github.com/zaynpatel/LTA2061/blob/main/PracticeLSA2.ipynb) but the calculation of that matrix is not.**

**Also note that I arranged my words alphabetically because I know this is how the TfidfVectorizer arranges its outputs. If I wanted to use my Python version I would need to sort the words before running the tf and idf steps

In [None]:
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer

documents = ['apple apple banana kiwi', 'apple kiwi kiwi kiwi']
words = ['apple', 'banana', 'kiwi']
dtm = np.zeros((len(documents), len(words)))

word_lookup = { v : k for k, v in enumerate(words)}

for d_idx, document in enumerate(documents):
    for word in document.split():
        word_idx = word_lookup[word]
        dtm[d_idx][word_idx] += 1
        
# Sum across the rows
tf_matrix = dtm / dtm.sum(axis = 1, keepdims=True)
col_sums = np.count_nonzero(dtm, axis = 0)
idf_matrix = np.log(np.divide(1 + len(documents), (1 + col_sums))) + 1

Tfidf = tf_matrix * idf_matrix
# It turns out this normalization step is critical to getting the right answer
norms = np.linalg.norm(Tfidf, axis = 1, keepdims=True)
Tfidf = Tfidf / norms
print(Tfidf)
# Test with sklearn vectorizer
vectorizer = TfidfVectorizer(norm='l2', use_idf=True, smooth_idf=True)
tf_idf_matrix = vectorizer.fit_transform(documents)
print(tf_idf_matrix.toarray())

[[0.5  0.25 0.25]
 [0.25 0.   0.75]]
[[0.75726441 0.53215436 0.37863221]
 [0.31622777 0.         0.9486833 ]]
[[0.75726441 0.53215436 0.37863221]
 [0.31622777 0.         0.9486833 ]]
