# Computing Similarities Across Large Document Datasets

excerpt from __Data Science Bookcamp: Five Python Projects__ MEAP V04 livebook by Leonard Apeltsin

<div class="alert alert-block alert-info">
Sweeping parts of the explainer text in this notebook was taken from the liveProject notebook.
</div>

In [8]:
# Load the principal NumPy array
import numpy as np
from numpy import load
tfidf_np_matrix = load('../data/df_Words.npz')
sample = tfidf_np_matrix['arr_0']
print(sample[0])
# print(sample.size) # -> its in the ballpark of 11,314 (nbr of posts) * 114,751 (nbr of unq words), but isn't exact

[0. 0. 0. ... 0. 0. 0.]


In [6]:
# Computing similarities to a single newsgrup post
cosine_similarities = tfidf_np_matrix['arr_0'] @ tfidf_np_matrix['arr_0'][0]
print(cosine_similarities)

[1.         0.00834093 0.04448717 ... 0.         0.00270615 0.01968562]


The matrix-vector product took (ed. more than) a few seconds to complete and output a vector of cosine similarities. Each `i`th index of the vector corresponds to the cosine similarity between `newsgroup.data[0]` and `newsgroup.data[1]`. The printout shows that `cosine_similarities[0]` is equal to 1.0. This is *not surprising* sonce `newsgroups_data[0]` will have perfect similarity to itself. The next-highest similarity in the vector is found in `np.argsort(cosine_similarities)[-2]`. The call to `argsort` will sort the array indices by their ascending values. Thus, the second-to-last (i.e. that `[-2]`) index will correspond to the post with the second highest-similarity.

NB. There is an assumption that no other posts exist with a perfect similarity of 1. There is also an alternative call you could do with `np.argmax(cosine_similarities[1:]) + 1` but it will only work for posts at index 0.

In [10]:
# Deps
from sklearn.datasets import fetch_20newsgroups
newsgroups = fetch_20newsgroups(remove=('headers', 'footers'))

# Finding the most similar newsgroup post
most_similar_index = np.argsort(cosine_similarities)[-2]
similarity = cosine_similarities[most_similar_index]
most_similar_post = newsgroups.data[most_similar_index]
print(f'---\n\nThe following post has a cosine similarity of {similarity:.2f} '
       'with newsgroups.data[0]:\n')
print(most_similar_post)

---

The following post has a cosine similarity of 0.64 with newsgroups.data[0]:

In article <1993Apr20.174246.14375@wam.umd.edu> lerxst@wam.umd.edu (where's my  
thing) writes:
> 
>  I was wondering if anyone out there could enlighten me on this car I saw
> the other day. It was a 2-door sports car, looked to be from the late 60s/
> early 70s. It was called a Bricklin. The doors were really small. In  
addition,
> the front bumper was separate from the rest of the body. This is 
> all I know. If anyone can tellme a model name, engine specs, years
> of production, where this car is made, history, or whatever info you
> have on this funky looking car, please e-mail.

Bricklins were manufactured in the 70s with engines from Ford. They are rather  
odd looking with the encased front bumper. There aren't a lot of them around,  
but Hemmings (Motor News) ususally has ten or so listed. Basically, they are a  
performance Ford with new styling slapped on top.

>    ---- brought to you by your

So we see that a reply contains the text of the original post (and we learn about the Bricklin in a world before web search). Due to the textual overlap, theie cosine similarity is 0.64. Although this number does not sound large,within an extensice text collections, a cosine similarity greater-than 0.6 is a good indicator of overlapping content.

NB. The cosine similarity can easily be converted into the Tanimoto similarity. This is done by running `cosine_similarities / (2 - cosine_similarities)`. However, that conversion will not change the final result as the top index of the Tamimoto array is the same posted reply.

### Realities: Why you should not reach to compute matrix of all-by-all cosine similaries

The TFIDF matrix has over 100k columns, so it is not computationally efficient to do this. We will reduce the column-count using Scikit-Learn's `TruncateSVD` class. 

Scikit-Learn's documentation occasioannly provides useful paramters for common algorithm application. The reduction of column count is available with the `n_components` parameter and the suggested value is 100 for processing text data. [RTFM](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.TruncatedSVD.html)

In [14]:
# Deps
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer(stop_words='english')
tfidf_matrix = tfidf_vectorizer.fit_transform(newsgroups.data)

# Dimensionally reducing `tfidf_matrix` using SVD
np.random.seed(0)
from sklearn.decomposition import TruncatedSVD

shrunk_matrix = TruncatedSVD(n_components=100).fit_transform(tfidf_matrix)
print(f'---\n\nWe\'ve dimensionally-reduced a {tfidf_matrix.shape[1]}-column '
      f'{type(tfidf_matrix)} matrix.\n')
print(f'---\n\nOur output is a {shrunk_matrix.shape[1]}-column '
      f'{type(shrunk_matrix)} matrix.')

---

We've dimensionally-reduced a 114441-column <class 'scipy.sparse.csr.csr_matrix'> matrix.
---

Our output is a 100-column <class 'numpy.ndarray'> matrix.


Now shrunk, we can compute cosine similarities by running `shrunk_matrix @ shrunk_matrix.T`, but first we will confirm that the matrix rows remain normalized.

In [18]:
# Deps
from numpy.linalg import norm

# Checking the magnitude of `shrunk_matrix[0]`
magnitude = norm(shrunk_matrix[0])
print(f'---\n\nThe magnitude of the first row is {magnitude:.2f}')

---

The magnitude of the first row is 0.49


💥 The magnitude is less-than 1, so the SVD output has not been automatically normalized. This will be done manually with built-in Scikit-Learn functions for our `shrunk_matrix`.

In [19]:
# Normalizing the SVD output
from sklearn.preprocessing import normalize
shrunk_norm_matrix = normalize(shrunk_matrix)
magnitude = norm(shrunk_norm_matrix[0])
print(f'---\n\nThe magnitude of the first row is {magnitude:.2f}')

---

The magnitude of the first row is 1.00


In [20]:
# Now calculate the matrix of all-by-all cosine similarities
cosine_similarity_matrix = shrunk_norm_matrix @ shrunk_norm_matrix.T

In [21]:
# Leverage the new (from shrunken) cosine similarity matrix 
# from a random pair of similar posts
np.random.seed(1) # nail down that seed
index1 = np.random.randint(len(newsgroups.data))
index2 = np.argsort(cosine_similarity_matrix[index1])[-2]
similarity = cosine_similarity_matrix[index1][index2]
print(f'---\n\nThe posts at indices {index1} and {index2} share a cosine '
      f'similarity of {similarity:.2f}')

---

The posts at indices 235 and 7805 share a cosine similarity of 0.91


In [22]:
# Printing the randomly chosen post
print(newsgroups.data[index2].replace('\n\n', '\n'))

Hello,
	Who can tell me   Where can I find the PD or ShareWare   
Which can CAPTURE windows 3.1's output of printer mananger?
	I want to capture the output of HP Laser Jet III.
	Though the PostScript can setup to print to file,but HP can't.
	I try DOS's redirect program,but they can't work in Windows 3.1
		Thankx for any help....
--
 Internet Address: u7911093@cc.nctu.edu.tw
    English Name: Erik Wang
    Chinese Name: Wang Jyh-Shyang


In [23]:
# Printing the most-similar response (my assumption is that it is an "answer")
print(newsgroups.data[index1].replace('\n\n', '\n'))

u7911093@cc.nctu.edu.tw ("By SWH ) writes:
>Who can tell me which program (PD or ShareWare) can redirect windows 3.1's
>output of printer manager to file? 
>	I want to capture HP Laser Jet III's print output.
> 	Though PostScript can setup print to file,but HP can't.
>	I use DOS's redirect program,but they can't work in windows.
>		Thankx for any help...
>--
> Internet Address: u7911093@cc.nctu.edu.tw
>    English Name: Erik Wang
>    Chinese Name: Wang Jyh-Shyang
> National Chiao-Tung University,Taiwan,R.O.C.
Try setting up another HPIII printer but when choosing what port to connect it
to choose FILE instead of like :LPT1.  This will prompt you for a file name
everytime you print with that "HPIII on FILE" printer. Good Luck.

