# Using Document Representations for Similarity Search

### ISM6564

---



# Introduction

In this notebook, we will prepare the data for text mining using techniques such as tokenization, stop word removal, and stemming. Also, we will represent our list of documents (in this case, a list of strings) as both a Count Vector and a Term Frequency Inverse Document (TF-ID) matrix. We will use the [scikit-learn](https://scikit-learn.org/stable/) library to perform these tasks.

In [26]:
import numpy as np
import pandas as pd
import re

# we will use spacy for lemmatization (it's much better than nltk)
import spacy

# we will use sklearn for feature extraction
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

# we will use sklearn for dimensionality reduction
from sklearn.decomposition import TruncatedSVD

np.random.seed(42)


let's start with a corpus

In [27]:
# Define the corpus of documents
corpus = [ # corpus is a list of documents
    "Dog bites man", # document 1
    "Canine nips at male.", # document 2, etc.
    "A Bus transports people.",
    "When all drove in the car.",
    "Her peppered steak tasted of pepper.",
    "She peppered the costume with flecks of glitter.",
]

# vocabulary is a list of unique words in the corpus

## Create a document by term (DTM) matrix

### Binary BOW

NOTE: I break this section into a number of smaller steps to illustrate details about what is happening with CountVectorizer. In later sections, I'll remove any steps used for explanation, and simplify this process down into one cell.

In [28]:
# fit a CountVectorizer to the data, with binary=True
binary_bow_vectorizer = CountVectorizer(
    stop_words='english', 
    lowercase=True, 
    binary=True
    )

binary_bow_vectorizer.fit(corpus)

In [29]:
# display the vocabulary of the corpus
print(f"{len(binary_bow_vectorizer.vocabulary_)} unique words in the vocabulary")
binary_bow_vectorizer.vocabulary_

18 unique words in the vocabulary


{'dog': 5,
 'bites': 0,
 'man': 10,
 'canine': 2,
 'nips': 11,
 'male': 9,
 'bus': 1,
 'transports': 17,
 'people': 12,
 'drove': 6,
 'car': 3,
 'peppered': 14,
 'steak': 15,
 'tasted': 16,
 'pepper': 13,
 'costume': 4,
 'flecks': 7,
 'glitter': 8}

In [30]:
# display the document to term binary matrix
binary_bow = binary_bow_vectorizer.transform(corpus)
binary_bow.toarray()

array([[1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0],
       [0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1],
       [0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0],
       [0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0]])

In [31]:
# put this vocabulary and the binary matrix together into one dataframe
binary_bow_df = pd.DataFrame(binary_bow.toarray(), columns=binary_bow_vectorizer.get_feature_names_out())
binary_bow_df

Unnamed: 0,bites,bus,canine,car,costume,dog,drove,flecks,glitter,male,man,nips,people,pepper,peppered,steak,tasted,transports
0,1,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0
1,0,0,1,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0
2,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1
3,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,0
5,0,0,0,0,1,0,0,1,1,0,0,0,0,0,1,0,0,0


If we wanted to translate this into a frequency term by document (TDM) matrix, we can simply transpose the matrix. NOTE: A TDM is not a standard representation for text mining, but it is useful for some applications.

In [32]:
binary_bow_df.transpose()

Unnamed: 0,0,1,2,3,4,5
bites,1,0,0,0,0,0
bus,0,0,1,0,0,0
canine,0,1,0,0,0,0
car,0,0,0,1,0,0
costume,0,0,0,0,0,1
dog,1,0,0,0,0,0
drove,0,0,0,1,0,0
flecks,0,0,0,0,0,1
glitter,0,0,0,0,0,1
male,0,1,0,0,0,0


### Raw Frequency BOW 

In [33]:
frequency_vectorizer = CountVectorizer(
    stop_words='english', 
    lowercase=True, 
    binary=False, 
    preprocessor=lambda x: re.sub(r'\d+', '', x)
    )

frequency = frequency_vectorizer.fit_transform(corpus)

frequency_df = pd.DataFrame(frequency.todense(), columns=frequency_vectorizer.get_feature_names_out())

frequency_df

Unnamed: 0,Bus,Canine,Dog,Her,She,When,bites,car,costume,drove,...,glitter,male,man,nips,people,pepper,peppered,steak,tasted,transports
0,0,0,1,0,0,0,1,0,0,0,...,0,0,1,0,0,0,0,0,0,0
1,0,1,0,0,0,0,0,0,0,0,...,0,1,0,1,0,0,0,0,0,0
2,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,1
3,0,0,0,0,0,1,0,1,0,1,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,1,1,1,1,0
5,0,0,0,0,1,0,0,0,1,0,...,1,0,0,0,0,0,1,0,0,0


### Term-Frequency Inverse-Document frequency BOW

In [34]:
### Create an inverse document frequency vectorizer

tfidf_vectorizer = TfidfVectorizer(
    stop_words='english', 
    lowercase=True, 
    binary=False, 
    )

tfidf = tfidf_vectorizer.fit_transform(corpus)

tfidf_df = pd.DataFrame(tfidf.todense(), columns=tfidf_vectorizer.get_feature_names_out())

tfidf_df

Unnamed: 0,bites,bus,canine,car,costume,dog,drove,flecks,glitter,male,man,nips,people,pepper,peppered,steak,tasted,transports
0,0.57735,0.0,0.0,0.0,0.0,0.57735,0.0,0.0,0.0,0.0,0.57735,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.57735,0.0,0.0,0.0,0.0,0.0,0.0,0.57735,0.0,0.57735,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.57735,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.57735,0.0,0.0,0.0,0.0,0.57735
3,0.0,0.0,0.0,0.707107,0.0,0.0,0.707107,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.521823,0.427903,0.521823,0.521823,0.0
5,0.0,0.0,0.0,0.0,0.521823,0.0,0.0,0.521823,0.521823,0.0,0.0,0.0,0.0,0.0,0.427903,0.0,0.0,0.0


## Calculate Similarities

In [40]:
from sklearn.metrics.pairwise import cosine_similarity

query = ["This morning, I peppered my dog with pepper spray"]

print(tfidf_vectorizer.transform(query).todense())

similarities = []
for i, row  in enumerate(tfidf):
    similarity = cosine_similarity(
        tfidf_vectorizer.transform(query), 
        row
        )
    print(f"{similarity[0][0]:10.8f}, {corpus[i]:s}")
    similarities.append(similarity[0][0])

print(similarities)    
most_similar = corpus[similarities.index(max(similarities))]
print(most_similar)
# print the document that has the highest cosine similarity to the query
print(f"'{query[0]:s}' is most similar to '{most_similar:s}'")


[[0.         0.         0.         0.         0.         0.61171251
  0.         0.         0.         0.         0.         0.
  0.         0.61171251 0.50161301 0.         0.         0.        ]]
0.35317238, Dog bites man
0.00000000, Canine nips at male.
0.00000000, A Bus transports people.
0.00000000, When all drove in the car.
0.53384753, Her peppered steak tasted of pepper.
0.21464157, She peppered the costume with flecks of glitter.
[0.3531723822361618, 0.0, 0.0, 0.0, 0.5338475288922488, 0.21464157332686065]
Her peppered steak tasted of pepper.
'This morning, I peppered my dog with pepper spray' is most similar to 'Her peppered steak tasted of pepper.'


## Discussion

The techniques we have learned thus far to represent documents as vectors all attempt to identify something 'unique' about a document. For example, the binary BOW representation identifies whether a term is present or not. The raw frequency BOW representation identifies how many times a term appears in a document. The TF-IDF representation identifies how many times a term appears in a document, but also how many times it appears in the corpus.

Though we could have used any of the techniques (binary, raw frequency, or TF-IDF) to calculate similarities, we chose to use the TF-IDF representation. Why? Because it is the most 'unique' representation of the three. The binary representation is the least unique, because it only identifies whether a term is present or not. The raw frequency representation is more unique than the binary representation, because it identifies how many times a term appears in a document. The TF-IDF representation is the most unique, because it identifies how many times a term appears in a document, but also how many times it appears in the corpus.

**When looking at the results, what potential problem are we seeing with TF-IDF?**

