# Text Featurization and Search

In this homework, you will featurize about 10,000 documents using different indexing techniques covered in the class: (i) term frequency, (ii) term document incidence matrix, and (iii) term frequency-inverse document frequency (tf-idf). You will then use these features to search through the document corpus for a given textual query.

It is worth recalling that featurization involves representing raw input data in form of vectors that can be then passed to machine learning or search models as the representation of input data. 

## Download and read the data

First, you will download and read some data from CSV. Our data comprises 10,000 sentences taken from patents pertaining to Artificial Intelligence. 

Patents often contain a lot of boilerplate text that is of little value while trying to automatically understand what a given specific text aims to convey. Such text is often comprised of words like "claims", "method", "system", etc. – these words occur in almost all the patent documents but convey little meaning about what the patent documents contain. 

You will see how tf-idf can help us represent sentences in patent documents by ensuring that boilerplate text is not given unnecessary consideration while representing the documents, leaving space for representing more informative text. 

In [1]:
# Install dependencis
!pip install nltk numpy sklearn

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting sklearn
  Downloading sklearn-0.0.post1.tar.gz (3.6 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: sklearn
  Building wheel for sklearn (setup.py) ... [?25l[?25hdone
  Created wheel for sklearn: filename=sklearn-0.0.post1-py3-none-any.whl size=2344 sha256=352395bf9e6bb83864cb87aa11dc2a372491ab29f788dd1699e5830c780e871d
  Stored in directory: /root/.cache/pip/wheels/14/25/f7/1cc0956978ae479e75140219088deb7a36f60459df242b1a72
Successfully built sklearn
Installing collected packages: sklearn
Successfully installed sklearn-0.0.post1


In [2]:
# Import dependencies; use !pip install <lib_name> to install, if needed
import pickle as pkl
import nltk
nltk.download('punkt') 
from nltk.corpus import stopwords
nltk.download('stopwords')
import string
from nltk.text import TextCollection
import numpy as np
from numpy import dot
from numpy.linalg import norm
from sklearn.metrics import ndcg_score

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [3]:
# Download the file from Dropbox and save it locally
!wget https://www.dropbox.com/s/6knv8c1iz22nivt/patent_sentences.pkl?dl=1 -O patent_sentences.pkl

--2023-01-24 23:07:20--  https://www.dropbox.com/s/6knv8c1iz22nivt/patent_sentences.pkl?dl=1
Resolving www.dropbox.com (www.dropbox.com)... 162.125.3.18, 2620:100:601d:18::a27d:512
Connecting to www.dropbox.com (www.dropbox.com)|162.125.3.18|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: /s/dl/6knv8c1iz22nivt/patent_sentences.pkl [following]
--2023-01-24 23:07:20--  https://www.dropbox.com/s/dl/6knv8c1iz22nivt/patent_sentences.pkl
Reusing existing connection to www.dropbox.com:443.
HTTP request sent, awaiting response... 302 Found
Location: https://uca60acd68042023121128e45840.dl.dropboxusercontent.com/cd/0/get/B1Pr9buUNTuuo3U_xvplbZi2d2RlEuEa8LZJmRIaUrhfWK6wxUpGDVX7RKwiObFigvrZWMXQZ8El1zrpwicI8LuLEjVfD9bqHyPq-sXYviIFPA93izbabp_88RUorKTbQJJLjHyfut6b9Cw1CkCr8iHREm0CnUvw0o0UURzVa5WIYzb2HbnGK6Al2vAGsaOdpOM/file?dl=1# [following]
--2023-01-24 23:07:21--  https://uca60acd68042023121128e45840.dl.dropboxusercontent.com/cd/0/get/B1Pr9buUNTuuo3U_xvplbZi2d2RlEuEa8

# Part 0: Load the data and normalize using lowercasing (0.1 points)

In [4]:
# Load the sentences
with open('patent_sentences.pkl', 'rb') as f:
    patent_sentences = pkl.load(f)

# YOUR CODE GOES HERE
# You are required to lowercase the text in `patent_sentences`
for i in range(len(patent_sentences)):
  patent_sentences[i] = patent_sentences[i].lower()

print("Number of sentences: ", len(patent_sentences))
print("\nSample sentences: ")
for sent in patent_sentences[:5]:
    print(sent)
    print("-"*10)

Number of sentences:  10000

Sample sentences: 
summary  this summary section is provided to introduce aspects of embodiments in a simplified form, with further explanation of the embodiments following in the detailed description.
----------
responsive to receiving a request to modify the existing recipe to meet the set of desired colors, the set of desired colors has at least one target color that is different from a set of existing colors of the final food dish, the illustrative embodiment identifies at least one of the set of existing colors to be changed to meet the desired set of colors.
----------
the method comprises the steps of receiving an image frame including a plurality of pixels, each of the pixel including a first pixel information; performing a multi-background generation module based on the plurality of pixels; generating a plurality of background pixels based on the multi-background generation module; performing a moving object detection module; and deriving the backg

# Part 1: Print the most frequent words in the entire corpus (1 point)

In this part, you will count the frequencies of unique words in the entire corpus and plot them on a histogram. For this, you will first convert the entire corpus (which is right now a list of strings) to a single string. Once you have the single string, you will use NLTK's `word_tokenize` function to convert the string into tokenized words. After this point, you will remove the stopwords from the tokenized list using the NLTK's stopwords list as well as punctuations from the text. You will then use NLTK's `FreqDist()` functions on the tokenized words to obtain the frequencies of words. The code for reverse sorting based on frequecies is provided.


In [5]:
entire_corpus = ' '.join(patent_sentences)

# YOUR CODE GOES HERE

# Step 1. Tokenize `entire_corpus`
entire_corpus_tokens = nltk.tokenize.word_tokenize(entire_corpus)

# Step 2. Remove stop words
stop = set(stopwords.words("english"))
filtered_word_tokens = [word.lower() for word in entire_corpus_tokens if word.lower() not in stop]

# Step 3. Remove punctuations
punc =  set(string.punctuation) 
filtered_word_tokens_nopunc = [word for word in filtered_word_tokens if word not in punc]

# Step 4: Get the frequency of various terms in the corpus
from nltk.probability import FreqDist
fdist = FreqDist()
for word in filtered_word_tokens_nopunc:
  fdist[word.lower()] += 1

token_freqs = list(fdist.items())

# Step 5: Ensure correct formatting
# Note: The final output of this step should be a variable named `token_freqs`, which should be a list of tuples in the format: (word, frequency)

# Reverse sort `token_freqs` based on the frequency of words
token_freqs.sort(key=lambda x: x[-1], reverse = True)
# Print the top 10 words and their frequency in the corpus
print("Top 10 words in the entire corpus based on their frequency: ")
for var in range(10):
    print("word: ", token_freqs[var][0], "  |   count: ", token_freqs[var][1])

Top 10 words in the entire corpus based on their frequency: 
word:  may   |   count:  2202
word:  data   |   count:  2091
word:  one   |   count:  2050
word:  system   |   count:  1385
word:  method   |   count:  1210
word:  first   |   count:  1017
word:  based   |   count:  956
word:  image   |   count:  955
word:  information   |   count:  946
word:  user   |   count:  935


In [6]:
# SANITY CHECK; DO NOT MODIFY THIS PART
# If your code above is correct, you should see that just the top 7 words in the entire vocabulary occur over 10,000 times in the entire corpus.
count = 0
for var in range(7):
    count += token_freqs[var][1]
print("The number of times top-7 words occur in the entire corpus", count)

The number of times top-7 words occur in the entire corpus 10911


# Question 1.1: Qualiatively analyzing the most frequent words (0.2 points)

Based on your understanding of Artificial Intelligence, which of the top-10 words are related to AI and which ones could be words that are boilerplate words that are used while writing patents. Note that words that are not specific to AI, could be used to writing patents in other fields as well, like biotechnology. For instance, patents are filed for *first* things in a field, and hence the word "first" could a highly frequent words that has nothing to do with Artificial Intelligence as such.

# Answer 1.1: 
Your answer goes here.  
List the words that are AI-related: data, system, method, image, information, user

List the words that are boilerplate patent words: may, one, first, based

# Part 2: Computing inverse document frequencies (idf) (0.6 points)
In this part, you are going to compute the inverse document frequency of each unique term in the entire training corpus. 

You will start with reverse sorted `token_freqs` computed in Part 1. You will then use NLTK's `idf` computations to find words with greatest values of inverse document frequencies. Again, the idf scores will be scored in a list of tuples of the format: (words, idf_score). The output variable should be named `token_idf_scores`. The relevant documentation can be viewed here: https://www.nltk.org/api/nltk.text.html?highlight=tf_idf#nltk.text.TextCollection.idf 

In [7]:
corpus = TextCollection(patent_sentences)

# Compute IDF scores
token_idf_scores = [] # This is the list of tuples of the format (word, idf_score)
for var in range(len(token_freqs)):
    # YOUR CODE GOES HERE
    word_idf = corpus.idf(token_freqs[var][0])
    token_idf_scores.append((token_freqs[var][0], word_idf))
    

In [8]:
# Reverse sort `token_freqs` based on the frequency of words
token_idf_scores.sort(key=lambda x: x[-1], reverse = True)
print("Top 10 words in the entire corpus based on their IDF scores: ")
for var in range(10):
    print("word: ", token_idf_scores[var][0], "  |   IDF score: ", token_idf_scores[var][1])

Top 10 words in the entire corpus based on their IDF scores: 
word:  subsample   |   IDF score:  9.210340371976184
word:  dumb   |   IDF score:  9.210340371976184
word:  enrollment   |   IDF score:  9.210340371976184
word:  fueling   |   IDF score:  9.210340371976184
word:  podcast   |   IDF score:  9.210340371976184
word:  allelic   |   IDF score:  9.210340371976184
word:  tour-related   |   IDF score:  9.210340371976184
word:  reinforced   |   IDF score:  9.210340371976184
word:  origination-destination   |   IDF score:  9.210340371976184
word:  depot   |   IDF score:  9.210340371976184


In [9]:
# SANITY CHECK; DO NOT MODIFY THIS CODE BLOCK
# If your code above is correct, you should see that the top 7 words based on idf scores occur only about 35 times in the entire corpus.
# This number is fairly smaller than the number of times top 7 words based on tf scores occur in the entire corpus – around 10,000.
count = 0
for var in range(7):
    curr_val = [x[1] for x in token_freqs if token_idf_scores[var][0] == x[0]][0]
    count += curr_val
print("The number of times top-7 words occur in the entire corpus", count)

The number of times top-7 words occur in the entire corpus 36


# Question 2.1: Qualitative analysis based on words with highest IDF scores (0.3 points)

(A) Based on your understanding of Artificial Intelligence, which of these top-10 words based on their IDF scores are related to AI and which ones could be words that are boilerplate words that are used while writing patents. For instance, words like "podcast" may not be considered generic boilerplate text in patent documents as it is unlikely to be encountered in patents that have nothing to do with podcasts. (0.1 points)

(C) On average, if you were to encouter a word from the list obtained using global frequencies versus IDF scores, which would be more informative towards determining what the patent sentence is about. (0.2 points)

# Answer 2.1: 
(A) Your answer goes here.  
List the words that are AI-related: subsample,  reinforced 


List the words that are boilerplate patent words: enrollment

(B) Your answer goes here.  

IDF scores would be more informative towards determining what the patent sentence is about.



# Question 2.2: What's the value of IDF scores over global frequencies? (0.3 points)

So far you have looked at unique words in a corpus being associated with two scores: global frequencies and IDF scores. Articulate why IDF scores could be better measures to understand the importance of a term in a corpus than its global frequencies. You may use specific examples from your outputs above. 

# Answer 2.2: 
Your answer goes here.

IDF scores are more likely to be informative towards determining what a sentence is about because it has the ability to emphasize the uniqueness of the term. Higher IDF scores indicate that the word is more unique, as opposed to lower scores associated with common words that carry less meaning, like "the" or "and."




# Part 3: Computing term frequencies (tf) (1.4 points)
In this part, you will represent each sentence in the corpus using the term frequecies. You will be using NLTK's `tf` computation. Read documentation here: https://www.nltk.org/api/nltk.text.html#nltk.text.TextCollection.tf. Note that `tf` takes tokenized text as input.

In [10]:
# Tokenize and vectorize the data using term frequencies
from nltk.text import TextCollection
corpus = TextCollection(patent_sentences)

#Construct global vocabulary here:
vocab_words = []
for sentence in patent_sentences:
    # YOUR CODE GOES HERE
    # STEP 1: Iterate over the tokenized words in the current sentence (remember to remove stop words and punctuations)
  
    ## Tokenize sentence
    s = nltk.tokenize.word_tokenize(sentence)

    ## Remove stop words
    stop = set(stopwords.words("english"))
    s_no_stop = [word.lower() for word in s if word.lower() not in stop]

    ## Remove punctuations
    punc =  set(string.punctuation) 
    s_no_punc = [word for word in s_no_stop if word not in punc]
    
    # STEP 2: If the current word in the sentence does not exist in `vocab_words`, append the current word to `vocab_words`
    for word in s_no_punc:
      if word not in vocab_words:
        vocab_words.append(word)
    
print("Number of unique words in the vocabulary: ", len(vocab_words))

all_tf_representations = []

for var in range(len(patent_sentences)):
# for var in range(10): 
# Note: Doing this computation over all the sentences might take some time. Therefore, you may consider writing your code and debugging by iterating over 10 examples.
# However, before submitting the code, make sure you run the snippet for all the sentences to ensure complete correctness. 
    current_tf_representation = [0.0 for x in vocab_words] # inititalization for storing the tf representation of the current sentence in this list
    # YOUR CODE GOES HERE
    # STEP 1: Obtain a list of unique words in the first sentence in `patent_sentences`
    # Note: Store the unique words in a variable called `current_sentence_words`

    ## Tokenize sentence
    s = nltk.tokenize.word_tokenize(patent_sentences[var])
    # print(s)

    ## Remove stop words & punctuation
    stop = set(stopwords.words("english"))
    punc = set(string.punctuation)
    stop_punc = stop.union(punc)
    s_no_stop = [word.lower() for word in s if word.lower() not in stop_punc]

    current_sentence_words_unique = list(set(s_no_stop))
  
    for word in current_sentence_words_unique:
        # YOUR CODE GOES HERE
        word_tf = corpus.tf(word, s_no_stop)
        # print(word_tf)
        current_tf_representation[vocab_words.index(word)] = word_tf
    
    all_tf_representations.append(current_tf_representation)

print("Size of featurized corpus: ", len(all_tf_representations), " x ", len(all_tf_representations[0]))


Number of unique words in the vocabulary:  14908
Size of featurized corpus:  10000  x  14908


In [11]:
# You will be implementing a sanity check to see if your tf featurization is correct
# Note, if your implementation is correct, the sum of any tf vectors should be 1
print("The first patent sentence is: ", patent_sentences[0])
print("The tf vectorization of the first patent sentence is: \n", all_tf_representations[0])
print("The words and their tf scores the sentences are provided below:")
sum_tf_for_one_sentence = 0

# YOUR CODE GOES HERE
# STEP 1: Obtain a list of unique words in the first sentence in `patent_sentences`
# Note: Store the unique words in a variable called `current_sentence_words`

first_sentence = patent_sentences[0]
first_sentence_tokenized = nltk.tokenize.word_tokenize(patent_sentences[0])

stop = set(stopwords.words("english"))
punc = set(string.punctuation)
stop_punc = stop.union(punc)
s_no_stop = [word.lower() for word in first_sentence_tokenized if word.lower() not in stop_punc]

current_sentence_words= list(set(s_no_stop))

# print(current_sentence_words)

for var in range(len(current_sentence_words)):
    print(current_sentence_words[var], all_tf_representations[0][vocab_words.index(current_sentence_words[var])])
    sum_tf_for_one_sentence += all_tf_representations[0][vocab_words.index(current_sentence_words[var])]
print("The sum of term frequencies for this sentence is: ", sum_tf_for_one_sentence)

## Note: the sum should come out to be one. If it's not 1, you may have to revisit how you are tokenizing, considering stopwords and punctuations while computing term frequencies for a given text

The first patent sentence is:  summary  this summary section is provided to introduce aspects of embodiments in a simplified form, with further explanation of the embodiments following in the detailed description.
The tf vectorization of the first patent sentence is: 
 [0.14285714285714285, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.14285714285714285, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.

# Part 4: Computing tf-idf scores (0.6 points)
You will modify the code you wrote in Part 3 above to add the idf component for each word while representing the sentence. This should only involve multiplying the IDF scores for each word in the code you wrote earlier.

In [12]:
all_tf_idf_representations = []

for var in range(len(patent_sentences)): 
# for var in range(10): 
# Note: Doing this computation over all the sentences might take some time. Therefore, you may consider writing your code and debugging by iterating over 10 examples.
# However, before submitting the code, make sure you run the snippet for all the sentences to ensure complete correctness. 
    current_tf_idf_representation = [0.0 for x in vocab_words] # inititalization for storing the tf-idf representation of the current sentence in this list
    # YOUR CODE GOES HERE
    # STEP 1: Obtain a list of unique words in the current sentence (with index as var) from `patent_sentences`
    # Note: Store the unique words in a variable called `current_sentence_words_unique`

    ## Tokenize sentence
    s = nltk.tokenize.word_tokenize(patent_sentences[var])

    ## Remove stop words & punctuation
    stop = set(stopwords.words("english"))
    punc = set(string.punctuation)
    stop_punc = stop.union(punc)
    s_no_stop = [word.lower() for word in s if word.lower() not in stop_punc]

    current_sentence_words_unique = list(set(s_no_stop))
    
    for word in current_sentence_words_unique:
        # YOUR CODE GOES HERE
        word_idf = corpus.idf(word)
        word_tf_idf = word_idf * corpus.tf(word, s_no_stop)
        current_tf_idf_representation[vocab_words.index(word)] = word_tf_idf
        
    all_tf_idf_representations.append(current_tf_idf_representation)
print("Size of featurized corpus: ", len(all_tf_idf_representations), " x ", len(all_tf_idf_representations[0]))

Size of featurized corpus:  10000  x  14908


In [13]:
# You will be implmenting a sanity check to see if your tf-idf featurization is correct (0.2 points out of 0.6 points)
# Note, if your implementation is correct, the sum of the tf-idf vector of the first sentence in `patent_sentences` should be around 4.0

print("The first patent sentence is: ", patent_sentences[0])
print("The tf-idf vectorization of the first patent sentence is: \n", all_tf_idf_representations[0])
print("The words and their tf-idf scores the sentences are provided below:")
sum_tf_idf_for_one_sentence = 0
# YOUR CODE GOES HERE
# STEP 1: Obtain a list of unique words in the first sentence in `patent_sentences`
# Note: Store the unique words in a variable called `current_sentence_words`

first_sentence = patent_sentences[0]
first_sentence_tokenized = nltk.tokenize.word_tokenize(patent_sentences[0])

stop = set(stopwords.words("english"))
punc = set(string.punctuation)
stop_punc = stop.union(punc)
s_no_stop = [word.lower() for word in first_sentence_tokenized if word.lower() not in stop_punc]

current_sentence_words= list(set(s_no_stop))

for var in range(len(current_sentence_words)):
    print(current_sentence_words[var], all_tf_idf_representations[0][vocab_words.index(current_sentence_words[var])])
    sum_tf_idf_for_one_sentence += all_tf_idf_representations[0][vocab_words.index(current_sentence_words[var])]
print("The sum of tf-idf for this sentence is: ", sum_tf_idf_for_one_sentence)

The first patent sentence is:  summary  this summary section is provided to introduce aspects of embodiments in a simplified form, with further explanation of the embodiments following in the detailed description.
The tf-idf vectorization of the first patent sentence is: 
 [0.5052881578292414, 0.3544175092804874, 0.24258641715365117, 0.4019158166800052, 0.28937750519726096, 0.3774660657948867, 0.43087761011630266, 0.1284414312164984, 0.4934110913558669, 0.28656023862216834, 0.32344351034656876, 0.2617973519640443, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 

# Part 5: Computing Term Document Incidence Matrix (1 point)
Recall tha term document incidence matrix represents each document as a one-hot vector, where the terms that are present are assigned the value 1 whereas the absent terms are assigned a 0. In principle, this is quite similar to term frequencies--- instead of actual counts (or normalized count) of terms, you have 1 or 0 depending on whether the terms are present or absent in the document. 

In the following section, you will create a term document incidence matrix without using any external libraries.

In [14]:
# Use the vocab_words vocabulary constructed earlier:
# The ordering of the terms in the matrix should be the same as the ordering in the vocab_words list

all_tdi_matrix = []
for var in range(len(patent_sentences)):
# for var in range(3): 
# Note: Doing this computation over all the sentences might take some time. Therefore, you may consider writing your code and debugging by iterating over 10 examples.
# However, before submitting the code, make sure you run the snippet for all the sentences to ensure complete correctness. 
    # YOUR CODE GOES HERE
    # 1. Initialize the current TDI vector with all zeros. The initialized vector should of the same length as `vocab_words`
    curr_tdi = [0.0 for x in vocab_words]

    # 2. Tokenize the current sentence into words using NLTK, while removing stop words and punctuations
    curr_sentence = patent_sentences[var]
    curr_sentence_tokenized = nltk.tokenize.word_tokenize(curr_sentence)

    stop = set(stopwords.words("english"))
    punc = set(string.punctuation)
    stop_punc = stop.union(punc)
    s_no_stop = [word.lower() for word in curr_sentence_tokenized if word.lower() not in stop_punc]

    current_sentence_words= list(set(s_no_stop))

    # 3. Iterate over unique words in current sentence
    for word in current_sentence_words:
      # 4. Assign 1 for indices of these unique words
      curr_tdi[vocab_words.index(word)] = 1

    # 5. Append the current TDI vector to `all_tdi_matrix` to populate a new row in the matrix 
    all_tdi_matrix.append(curr_tdi)

print("Size of featurized corpus: ", len(all_tdi_matrix), " x ", len(all_tdi_matrix[0]))

Size of featurized corpus:  10000  x  14908


Now, as a sanity check,  you will ensure that the number of 1's in given sentence is equal to the number of unique words in that sentence

In [15]:
sanity_check_indices = [22, 11, 42, 888]
for idx in sanity_check_indices:
    current_sentence = patent_sentences[idx]
    # YOUR CODE GOES HERE
    # 1. Tokenize the current sentence into words using NLTK, while removing stop words and punctuations
    curr_sentence = patent_sentences[idx]
    curr_sentence_tokenized = nltk.tokenize.word_tokenize(curr_sentence)

    stop = set(stopwords.words("english"))
    punc = set(string.punctuation)
    stop_punc = stop.union(punc)
    s_no_stop = [word.lower() for word in curr_sentence_tokenized if word.lower() not in stop_punc]

    current_sentence_words= list(set(s_no_stop))

    # 2. Count the number of unique words in the sentence
    num_unique_words = len(current_sentence_words)
    
    # If the number of unique words in the sentence is equal to the sum of the corresponding row in `all_tdi_matrix`, print "True"
    if num_unique_words == sum(all_tdi_matrix[idx]):
        print("True")


True
True
True
True


# Part 6: Search Using Text Representations (1 point)
In this part, you will write the following functions to enable searching through the featurized documents for a provided query:

1. `featurize_query()`:   
    This is to featurize a textual query using the specified method.  
    This part is worth 0.6 points.
2. `compute_cosine_similarity()`:  
    This is to compute the cosine similarity between the featurized query and all the 10,000 documents in the corpus.  
    This part is worth 0.4 points. 

With the use of these two functions, you will obtain the 10 most similar documents for the query "TODO".  

In [16]:
# Codeblock to featurize a query
# The idea here is to convert the query, which is in a textual/string format, in the same feature space as that of our documents
# In previous sections, we featurized the docuemnts using tf, tf-idf, and tdi. We are going to do the same for any given query

def featurize_query(query, technique):
    featurized_query = [0.0 for x in vocab_words] # this is the final vector that will be returned
    if technique == 'tf':
        # YOUR CODE GOES HERE
        # 1. Tokenize the query to only keep relevant tokens
        query_tokenized = nltk.tokenize.word_tokenize(query)

        stop = set(stopwords.words("english"))
        punc = set(string.punctuation)
        stop_punc = stop.union(punc)
        query_fin = [word.lower() for word in query_tokenized if word.lower() not in stop_punc]

        query_fin = list(set(query_fin))

        # 2. Featurize it using vocab_words variable with tf values
        for word in query_fin:
          # YOUR CODE GOES HERE
          featurized_query[vocab_words.index(word)] = corpus.tf(word, query_fin)
        
    elif technique == 'tf-idf':
        # YOUR CODE GOES HERE
        # 1. Tokenize the query to only keep relevant tokens
        query_tokenized = nltk.tokenize.word_tokenize(query)

        stop = set(stopwords.words("english"))
        punc = set(string.punctuation)
        stop_punc = stop.union(punc)
        query_fin = [word.lower() for word in query_tokenized if word.lower() not in stop_punc]

        query_fin = list(set(query_fin))

        # 2. Featurize it using `vocab_words` and `corpus` variable with tf-idf values
        for word in query_fin:
          # YOUR CODE GOES HERE
          word_idf = corpus.idf(word)
          word_tf_idf = word_idf * corpus.tf(word, query_fin)
          featurized_query[vocab_words.index(word)] = word_tf_idf

    elif technique == 'tdi':
        # YOUR CODE GOES HERE
        # 1. Tokenize the query to only keep relevant tokens
        query_tokenized = nltk.tokenize.word_tokenize(query)

        stop = set(stopwords.words("english"))
        punc = set(string.punctuation)
        stop_punc = stop.union(punc)
        query_fin = [word.lower() for word in query_tokenized if word.lower() not in stop_punc]

        query_fin = list(set(query_fin))

        # 2. Featurize it using `vocab_words` with incidence values

        ## Iterate over unique words in current sentence
        for word in query_fin:
          ## Assign 1 for indices of these unique words
          featurized_query[vocab_words.index(word)] = 1
        
    else:
        print("The specified featurization technique is not included in the implementation. Please specify of the following featurization techniques: 'tf', 'tf-idf', 'tdi'")
        return
    return featurized_query

In [17]:
# Compute cosine similarity between the query vector and all the document vectors
# Return a list of float with the number of elements equal to the number of documents in the corpus

def compute_cosine_similarity(query_vector, document_vectors):
    similarity_values = []
    for var in range(len(document_vectors)):
        # YOUR CODE GOES HERE
        curr_doc = document_vectors[var]

        # 1. Compute cosine similarity between the query vector and the current document vector
        # Note: You may want to use Numpy's `dot` and `linalg.norm` for calculating the cosine similarities
        cos_sim = dot(query_vector, curr_doc) / (norm(query_vector) * norm(curr_doc))

        # 2. Append the compute similarity value to the `similarity_values` list
        similarity_values.append(cos_sim)
        
    return similarity_values

In [18]:
query = 'autonomous vehicle operator database'

# Compute query vector using each of the featurization techniques
query_vector_tf = featurize_query(query, 'tf')
query_vector_tfidf = featurize_query(query, 'tf-idf')
query_vector_tdi = featurize_query(query, 'tdi')

# Compute the similarity scores
similarity_scores_tf = compute_cosine_similarity(query_vector_tf, all_tf_representations)
similarity_scores_tfidf = compute_cosine_similarity(query_vector_tfidf, all_tf_idf_representations)
similarity_scores_tdi = compute_cosine_similarity(query_vector_tdi, all_tdi_matrix)

In [19]:
# Obtain the indices of top-10 most similar documents to the query
top_10_tf = sorted(range(len(similarity_scores_tf)), key=lambda i: similarity_scores_tf[i])[-10:]
top_10_tfidf = sorted(range(len(similarity_scores_tfidf)), key=lambda i: similarity_scores_tfidf[i])[-10:]
top_10_tdi = sorted(range(len(similarity_scores_tdi)), key=lambda i: similarity_scores_tdi[i])[-10:]

print("Top 10 most similar sentences using tf (with sim_value): ")
count = 0
for idx in top_10_tf[::-1]:
    count += 1
    print(count, " | (", similarity_scores_tf[idx], ") | ", patent_sentences[idx])

# SANITY CHECK
print("\n\nSimilarity scores for tf should be in the range 0.38 - 0.5: ", similarity_scores_tf[top_10_tf[-1]] < 0.5 and similarity_scores_tf[top_10_tf[0]] > 0.38)

print("\n******************\n")

print("Top 10 most similar sentences using tf-idf (with sim_value): ")
count = 0
for idx in top_10_tfidf[::-1]:
    count += 1
    print(count, " | (", similarity_scores_tfidf[idx], ") | ", patent_sentences[idx])

# SANITY CHECK
print("\n\nSimilarity scores for tf-idf should be in the range 0.28 - 0.4: ", similarity_scores_tfidf[top_10_tfidf[-1]] < 0.4 and similarity_scores_tfidf[top_10_tfidf[0]] > 0.28)

print("\n******************\n")

print("Top 10 most similar sentences using tdi (with sim_value): ")
count = 0
for idx in top_10_tdi[::-1]:
    count += 1
    print(count, " | (", similarity_scores_tdi[idx], ") | ", patent_sentences[idx])

# SANITY CHECK
print("\n\nSimilarity scores for tdi should be in the range 0.3 - 0.38: ", similarity_scores_tdi[top_10_tdi[-1]] < 0.38 and similarity_scores_tdi[top_10_tdi[0]] > 0.3)

Top 10 most similar sentences using tf (with sim_value): 
1  | ( 0.4931969619160719 ) |  meanwhile, a user has to drive by himself/herself a vehicle that cannot normally perform autonomous driving, for example, a vehicle in which an error has occurred in an autonomous driving-related sensor and or a manually driven vehicle which does not support an autonomous driving function.
2  | ( 0.42640143271122083 ) |  it is an object of the invention to facilitate control of an autonomous vehicle by using one or more variable pitch cameras positioned at different locations and/or orientations of the autonomous vehicle.
3  | ( 0.4225771273642583 ) |  this means that, when an exterior monitoring camera on an autonomous vehicle fails during autonomous driving, there is a possibility that the vehicle's safe driving control and autonomous driving control will be affected.
4  | ( 0.4166666666666667 ) |  other systems, for example autopilot systems, may be used only when the system has been engaged, wh

# Part 6.1: Qualitative Analysis of Search Results (0.4 points)

Compare the retrived and ranked results using the three featurization techniques by qualitatively inspecting the top-10 sentences and their scores. What do you think is the most effective featurization technique, which one is the worst? You are encouraged to state which retrieved sentences are great responses for the query "autonomous vehicle operator database," and which are particularly bad responses.

# Answer 6.1

Given the query "autonomous vehicle operator database", I think the results given for tf-idf are the most informative -- specifically, the first result for tf-idf "in one aspect, the method may include acquiring vehicle-operator data from a database..." and the third result "other systems, for example autopilot systems, may be used only when..." are responses that I would find particularly helpful if I were doing research on this topic.

The search results for tdi were the least informative. They appear generally shorter in length, and some of the results are not even full sentences. For instance, the first result for tdi was " the autonomous vehicle may call for assistance when such a situation is identified" which is not informative at all.



# Part 6.2: Evaluating Search Results using NDCG (0.8 points)

For this part, you will quanitfy the NDCG scores for search using each of these featurization techniques. We will make some assumptions to quantify NDCG for the single query presented. 

Assumption 1: We are supposed to rank 10 documents, the indices of which are given below. The subset of documents are stored in the variable `patent_sentences_subsample`.

Assumption 2: The ground-truth relevance of these documents is given as using the list `groundtruth_relevance`. Groundtruth relevance captures how relevant each of the documents are to the given query, based on the inputs of end-users/annotators. An ideal search engine should rank items in the same order as their groundtruth relevance.

You will obtain the similarity score of the query with these 10 documents and compute the NDCG metric to quantify the search quality of retrieved results. To compute NDCG, please look into sklearn's `ndcg_score` here: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.ndcg_score.html

In [20]:
# Create patent_sentences_subsample below
patent_sentences_subsample = []
indices_of_interest = [2211, 352, 4254, 6632, 1609, 7207, 4101, 5999, 7733, 5383]
for idx in indices_of_interest:
    patent_sentences_subsample.append(patent_sentences[idx])
# The corresponding tf, tf-idf, and tdi vectors for these documents are stored in the variables below
tf_features_subsample = [all_tf_representations[x] for x in indices_of_interest]
tfidf_features_subsample = [all_tf_idf_representations[x] for x in indices_of_interest]
tdi_features_subsample = [all_tdi_matrix[x] for x in indices_of_interest]

groundtruth_relevance = [0.9, 0.3, 0.6, 0.5, 0.4, 0.45, 0.2, 0.2, 0.4, 0.4] # these are the groundtruth relevance scores provided to you. 

# Function to compute the NDCG score
def compute_ndcg(similarity_scores, groundtruth_relevance):
    # YOUR CODE GOES HERE
    similarity_scores = [similarity_scores]
    groundtruth_relevance = [groundtruth_relevance]
    current_ndcg_score = ndcg_score(y_true=np.asarray(groundtruth_relevance), y_score=np.asarray(similarity_scores))
    
    return current_ndcg_score

# Compute the similarity vectors for each of the techniques below and print the NDCG values 
# YOUR CODE GOES HERE
# STEP 1: Compute NDCG for tf
tf_simscores = [similarity_scores_tf[i] for i in indices_of_interest]
ndcg_tf = compute_ndcg(tf_simscores, groundtruth_relevance)

# STEP 2: Compute NDCG for tf-idf
tfidf_simscores = [similarity_scores_tfidf[i] for i in indices_of_interest]
ndcg_tfidf = compute_ndcg(tfidf_simscores, groundtruth_relevance)

# STEP 3: Compute NDCG for tdi
tfidf_simscores = [similarity_scores_tdi[i] for i in indices_of_interest]
ndcg_tdi = compute_ndcg(tfidf_simscores, groundtruth_relevance)

print("NDCG using tf: ", ndcg_tf) 
print("NDCG using tf-idf: ", ndcg_tfidf)
print("NDCG using tdi: ", ndcg_tdi)

NDCG using tf:  0.8387169347976292
NDCG using tf-idf:  0.966127125519166
NDCG using tdi:  0.8982851716822671


In [21]:
# SANITY CHECK; DO NOT MODIFY THIS CODE BLOCK
print("The NDCG for tf should be between 0.82 and 0.84:", ndcg_tf > 0.82 and ndcg_tf < 0.84)
print("The NDCG for tf-idf should be between 0.95 and 0.97:", ndcg_tfidf > 0.95 and ndcg_tf < 0.97)
print("The NDCG for tdi should be between 0.88 and 0.91:", ndcg_tdi > 0.88 and ndcg_tf < 0.91)



The NDCG for tf should be between 0.82 and 0.84: True
The NDCG for tf-idf should be between 0.95 and 0.97: True
The NDCG for tdi should be between 0.88 and 0.91: True


# Question 6.3: Different form of groundtruth (0.3 points)

(A) Imagine you are part of a team building the "next Google" and are asked to choose between MRR and NDCG as the primary evaluation metric for your new search engine. Given that you still want to build an interface where the users can browse through multiple retrieved documents for an input query, which metric would you advocate for? Explain your reasoning. (0.15 points)

(B) A search engine has an MRR of 0.3333. Is it reaonable to say that, on average, the search engine retrieves the most relevant result at rank 3? Explain your answer. (0.15 points)

# Answer 6.3:
(A) I would advocate for NDCG because MRR only focuses on the first relevant item in the search, while users should ideally be able to browse through multiple search results, hopefully ranked in order of usefulness. NDCG measures the overall quality of a set of search results and takes ranking into account, which makes it a better choice for the "next Google."

(B) Since MRR measures the mean location of the first document, 1/3 means that the most relevant document is, on average at rank 3.