In [1]:
%%capture
import pandas as pd
import nltk
import os
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet, stopwords
from nltk.tokenize import word_tokenize
from tqdm.notebook import tqdm
from sklearn import feature_extraction
import numpy as np
import dotenv
lemmatizer = WordNetLemmatizer()
dotenv.load_dotenv('../ext_variables.env') # Necessary to avoid putting absolute paths
os.chdir(os.getenv("PATH_FILES_ADM"))
tqdm.pandas() # add tqdm progress_apply method for Pandas classes (DataFrame, Series and GroupBy classes)
nltk.download('wordnet') # wordnet
nltk.download("stopwords") # stopword list
nltk.download('punkt') # sentence separation
nltk.download('averaged_perceptron_tagger') # pos-tagging
nltk.download('omw-1.4') # wordnet

In [3]:
# First of all, let's load the csv/the tsv
places  = pd.read_csv("places.tsv", sep = "\t", index_col=0)
places.drop_duplicates(inplace=True)

## 2. Search Engine
<a id = "point_2"></a>

First of all, let's perform some pre-processing with lemmatization (incorporating POS tagging), stopwords removal (what remains of them at that point) and lowercase conversion on the _description_ columns.

The first function in the next block of code converts the tag from the [Penn Treebank project](https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html) to the ones of [WordNet](https://wordnet.princeton.edu).

The second function performs the whole-preprocessing itself:
 - First of all, it tokenizes the strings with `word_tokenize`. This may seem simple, but it is not. Under the hood, `nltk` uses an already trained (for English) unsupervised model able to understand how to split the string into sentence (one can retrain the unsupervised model with a particular corpus, with a specific target language). The output is then parsed with a RegEx expression in order to be split into words.
  - The tokenized output is then passed to the part-of-speech tagger, whose details can be found in a very good blog post, [here](https://explosion.ai/blog/part-of-speech-pos-tagger-in-python).
  - The tags are mapped to the ones from WordNet with the function `get_wordnet_pos`, the first defined in the block. Notice that tags not related to adjectives, verbs, nouns or adverbs will be cancelled out by the function, returning `None`. Once this is done, if the resulting tag has `None` value, the related token/word is not considered.
  - With the tuples `(token, pos_tag)` we can finally call the WordNet lemmatizer (more specifically, the _morphy_ function, more info [here](https://wordnet.princeton.edu/documentation/morphy7wn)).
  - The output is converted to lower case and checked against stopwords (and against some punctuation which may still be there).
  - The words are joined together with the `|` separator, a character that I do not expect to be popular in the English language.

Notice that with this process we have implicitly removed punctuation.

In [3]:
# For the following mapping, credits to this stack overflow question: https://stackoverflow.com/questions/15586721/wordnet-lemmatization-and-pos-tagging-in-python
def get_wordnet_pos(treebank_tag):
    if treebank_tag.startswith('J'):
        return wordnet.ADJ
    elif treebank_tag.startswith('V'):
        return wordnet.VERB
    elif treebank_tag.startswith('N'):
        return wordnet.NOUN
    elif treebank_tag.startswith('R'):
        return wordnet.ADV
    else:
        return None


# The following is a function which performs the whole pre-processing with lemmatization (incorporating information from POS tagging), stopwords removal and lowercase conversion
def preprocessing_field(description: str) -> str:
    # First of all, we need to tokenize the text.
    # We use word_tokenize from nltk
    tokenized = word_tokenize(description, language='english')
    # Then we use the pos tagger from nltk
    pos_tagged = nltk.pos_tag(tokenized)
    # Then we convert tags from Penn Treebank format to the one of WordNet
    # We also perform some manual filtering of punctuation that may still be there
    converted_tags_words = [(x[0], get_wordnet_pos(x[1])) for x in pos_tagged]
    # Filter everything that is not an adverb, a verb, a noun or an adjective
    converted_tags_words = [(x[0], x[1]) for x in converted_tags_words if x[1] is not None]
    # Lemmatize the words (WordNet lemmatizer, morphy function)
    lemmatized_words = [lemmatizer.lemmatize(word = x[0], pos = x[1]) for x in converted_tags_words]
    # Lowercase conversion and stopwords removal (most of the stopwords were anyway removed with get_wordnet_pos and related filtering)
    # Some residual punctuation is also removed. Notice that stopwords removal could be done also later on with CountVectorizer and scikit, but we do it here
    lemmatized_words = [x.lower() for x in lemmatized_words if (x.lower() not in stopwords.words("english")) and (x not in ("’", "‘", "\'", "‟", "”", "-","“","."))]
    # Join the words into strings with words separated explicitly (afterwards we will only need to specify the separator for scikit learn CountVectorizer)
    return "|"+"|".join(lemmatized_words)+"|"

The following is just a little example to show the power of what we are doing right now. It is a little homage to Breiman, from his Wikipedia page.

Also notice that, contrary to what we do with stemming, lemmatization retains something which can be understood and read; this in general improves visualization, reporting and debugging with text data. Additionally, lemmatization preserves far more variability than stemming, for obvious reasons (we are reducing words to their dictionary lemma, not to their root), and this may be useful since we are working on a search engine, whose queries may aim at specific words.

However, we may not always want this variability, and this is indeed the case for some specific applications, such as ML modelling where we prefer to catch the signal while disregarding noise due to specific lemmas.

In [4]:
preprocessing_field("Leo Breiman was a distinguished statistician at the University of California, Berkeley. He was the recipient of numerous honors and awards, and was a member of the United States National Academy of Sciences. Breiman's work helped to bridge the gap between statistics and computer science, particularly in the field of machine learning. His most important contributions were his work on classification and regression trees and ensembles of trees fit to bootstrap samples. Bootstrap aggregation was given the name bagging by Breiman. Another of Breiman's ensemble approaches is the random forest.")

'|leo|breiman|distinguished|statistician|university|california|berkeley|recipient|numerous|honor|award|member|united|states|national|academy|sciences|breiman|work|help|bridge|gap|statistic|computer|science|particularly|field|machine|learning|important|contribution|work|classification|regression|tree|ensemble|tree|fit|bootstrap|sample|bootstrap|aggregation|give|name|bagging|breiman|breiman|ensemble|approach|random|forest|'

Now we apply what we have described to the whole _descriptions_ columns. We are using the integration between `pandas` and `tqdm` in order to keep track of progress. Also notice that we are not overwriting the previous columns (the search engine should return a polished output, not the columns pre-processed for the search engine).

In [5]:
# Lemmatization of Description and Short Description with POS tagging
# We simply assign the description columns to the transformed descriptions
places["placeDesc_post"], places["placeShortDesc_post"] = places.placeDesc.progress_apply(preprocessing_field), places.placeShortDesc.progress_apply(preprocessing_field)

  0%|          | 0/7361 [00:00<?, ?it/s]

  0%|          | 0/7361 [00:00<?, ?it/s]

## 2.1 Conjunctive query
<a id = "point_2.1"></a>

Next step is building the inverted index. In order to do that, the first step is building a so-called Term-Document matrix, which is easily built with scikit-learn, more specifically with its (very, very useful) `CountVectorizer` class. The `transform` method of the class returns a Document-Term matrix, so we need to take its transpose. For every row (a document) we will have a list (along the columns) of one-hot representations (presence or absence of a word).

We need to pass a RegEx expression to enforce the separator we have placed before: `|`. We also pass other arguments to the `__init__` method, but they are relatively straightforward.


In [6]:
# The RegEx token pattern is very simple here, it just has a single capturing group with a non-greedy quantifier and a lookahead and a lookbehind to avoid consuming "|" in the match, which is a NO-GO
# Binary = TRUE => one hot encoding representation
one_hot_vectorizer = feature_extraction.text.CountVectorizer(strip_accents=False, lowercase=False, token_pattern=r"(?<=\|)(.*?)(?=\|)", binary = True)
one_hot_vectorizer.fit(places.placeDesc_post) # Get the vocabulary from the corpus
term_document = one_hot_vectorizer.transform(places.placeDesc_post).transpose() # Transform the corpus into the document-term matrix, then take the transpose of the output

We then get the dictionary mapping each word to its index in the matrix. In order to do that, we can just access an attribute of the one_hot_vectorizer object. It will be saved, serialized, in storage. For that we can just use the `pickle` Python module.

In [7]:
import pickle
# The vocabulary is saved, serialized, in storage
vocabulary_word_index = one_hot_vectorizer.vocabulary_
with open("vocabulary_word_index.pickle", "wb") as vocab_file:
    pickle.dump(vocabulary_word_index, file = vocab_file)

In order to get the dictionary/the inverted index we just need to extract the indexes of the non-zero entries in the sparse Document-Term matrix. Since the `sparse.csc_matrix` class of _scipy_ (whose instance is returned by the `transform` method of the `CountVectorizer` class) has a method for this, we can just use that one. What is returned, as usual, are two NumPy arrays, one for the indexes referring to the rows, and one for the indexes referring to the columns. They are two iterables, so we `zip` them together and iterate over them in order to build the inverted index.

Again, the resulting dictionary is saved in storage as a serialized object with the `pickle` module.

In [8]:
from collections import defaultdict
inverted_index_onehot = defaultdict(list)
for row in zip(term_document.nonzero()[0], term_document.nonzero()[1]):  # Build iterator that returns row and column index of non zero at each iteration
    inverted_index_onehot[row[0]].append(row[1])
with open("inverted_index_onehot.pickle", "wb") as inverted_index_file:
    pickle.dump(inverted_index_onehot, file = inverted_index_file)

In order to load them into memory from storage we simply need to use `pickle` again, this time with the `load` function.

In [9]:
import pickle

with open("vocabulary_word_index.pickle", "rb") as vocab_file:
    vocabulary_word_index = pickle.load(vocab_file)
with open("inverted_index_onehot.pickle", "rb") as inverted_index_file:
    inverted_index_onehot = pickle.load(inverted_index_file)

At this point, we just need to define the search engine function. It is enough to take the intersection between the sets/the lists (the values in the dictionary) containing the document ids for each of the terms in order to get the output of the search engine from the original DataFrame.

In [10]:
import re
def search_engine1(vocabulary:dict[str:int], inverted_index:dict[int: list[int, ...]], dataframe: pd.DataFrame) -> pd.DataFrame:
    """
    This function asks for a query and retrieves the rows from the input DataFrame whose pre-processed placeDesc entry contains all the words in the query. It does this by completely relying on an inverted index data structure, which in turn relies on a vocabulary dictionary in order to be mapped back to the original string representations of the words/tokens.
    :param vocabulary: The vocabulary dictionary mapping the string representation for each token to the related integer index
    :param inverted_index: The inverted index mapping each index to a list of documents ids
    :param dataframe: The Pandas DataFrame from which to retrieve the places. Notice that the function assumes that the Dataframe has [placeName, placeURL and placeDesc] in its column index
    :return: a Pandas DataFrame with all the documents containing all the words in the query
    """
    query = input("Query: ") # Ask for input
    query_elements = re.split(r'\s+', query.lower()) # Split according to one or more whitespace with a simple expression
    query_elements = list(set(query_elements)) # Get only unique words
    token_ids = list(map(vocabulary.get, query_elements)) # Get the ids for each of the tokens from the vocabulary. None is returned if requested token is not present. Note that we have to get a list from the map object; this is due to the fact that the returned map object is an iterator, and thus the __iter__ method returns the object itself.
    output_docs = []
    for id in token_ids:
        # If a term is missing in the vocabulary, the loop is immediately stopped, and the DataFrame equivalent of None is returned by passing an empty list to Pandas indexing.
        if not all(token_ids):
            break
        if not output_docs: # Output docs is empty list at first, and it evaluates to False
            output_docs = set(inverted_index[id]) # Initialize the set of the output docs (1st term)
        else:
            output_docs = output_docs.intersection(set(inverted_index[id])) # Take the intersection for each term after the first
    output = dataframe.iloc[list(output_docs)][["placeName", "placeDesc", "placeURL"]]
    output.placeURL = output.placeURL.str.replace(r"https://www.atlasobscura.com", "", regex = False) # Remove redundant part of the URL
    return output

In [26]:
search_engine1(vocabulary_word_index, inverted_index_onehot, places)

Unnamed: 0,placeName,placeDesc,placeURL
7169,Metropolitan Pit Stop,Metropolitan Pit Stop was founded by Jimmy Val...,/places/metropolitan-pit-stop
4610,Cabot's Pueblo Museum,There’s a fascinating museum high up in the hi...,/places/cabots-pueblo-museum-2
2055,Old Slave Mart,"Built in 1859, the Old Slave Mart was actually...",/places/old-slave-mart
519,KattenKabinet,The death of a pet can inspire a number of rea...,/places/kattenkabinet
4617,Cowgirl Hall of Fame,"When we think of cowgirls, ivory halls don’t i...",/places/cowgirl-hall-fame
...,...,...,...
5111,Museum of Un-Natural History,"Throughout the ’70s and ’80s, Gerald Matthews ...",/places/museum-of-unnatural-history-walla-walla
1528,Dr Pepper Museum,The oldest major manufacturer of soft drink co...,/places/dr-pepper-museum
3578,Batalion Comic Book Museum and Club,Walking into Prague’s Batalion Comic Book Muse...,/places/batalion-comic-book-museum-and-club
3580,Geppi's Entertainment Museum,It’s a unique place that can create a sentimen...,/places/geppi-s-entertainment-museum


## Conjunctive query & ranking score
<a id = "point_2.2"></a>

The first step here is similar if not identical to the one for [Point 2.1](#point_2.1). We use again scikit-learn, but this time with the TF-IDF Vectorizer. We already have the dictionary, so we pass it to the constructor method of the class. Notice that the computed sparse document term matrix has its row vectors already L2-normalized. This makes our lives easier when computing the cosine similarity later on.

In [12]:
# The RegEx token pattern is very simple here, it just has a single capturing group with a non-greedy quantifier and a lookahead and a lookbehind to avoid consuming "|" in the match, which is a NO-GO
# Binary = TRUE => one hot encoding representation
tfidf_vectorizer = feature_extraction.text.TfidfVectorizer(strip_accents=False, lowercase=False, token_pattern=r"(?<=\|)(.*?)(?=\|)", vocabulary = vocabulary_word_index)
tfidf_vectorizer.fit(places.placeDesc_post)
term_document_tfidf = tfidf_vectorizer.transform(places.placeDesc_post).transpose() # Transform the corpus into the document-term matrix (with tf-idf weights), then take the transpose of the output

Now we have to build the new inverted index with the tf-idf information. The way we build it is very similar to what we have done in [Point 2.1](#Point_2.1). The main difference here is that we need to incorporate the tf-idf information, with the value for each key in the dictionary becoming a list of tuples.

First of all, we need to access the non-zero elements of the sparse matrix returned by the `transform` method of the `TfidfVectorizer` class. We can just use some fancy indexing on the sparse matrix object to do that.

In [13]:
# %%prun # Profile the code to check that fancy indexing is working (documentation for SciPy is really not up to the standards)
# I wanted to be sure that it was NOT iterating over the indexes in order to retrieve the elements from the sparse matrix. In other words, I wanted to check the vectorization of the retrieval op
from collections import defaultdict
inverted_index_tfidf = defaultdict(list)

nonzero_entries = term_document_tfidf.nonzero() # Get indexes (for rows and for columns) of the non-zero entries in the sparse matrix

flattened_sparse = np.asarray(term_document_tfidf[nonzero_entries[0], nonzero_entries[1]]).flatten() # Flatten the matrix to a vector of its non-zero entries (simple fancy indexing op)
for row in zip(nonzero_entries[0], nonzero_entries[1], flattened_sparse):  # Build iterator that returns row and column index + value of non-zero entries at each iteration
    inverted_index_tfidf[row[0]].append((row[1], row[2])) # Appending tuples to the list assigned to each term/token/key
with open("inverted_index_tfidf.pickle", "wb") as inverted_index_file:
    pickle.dump(inverted_index_tfidf, file = inverted_index_file)

In [14]:
inverted_index_tfidf

defaultdict(list,
            {0: [(31, 0.12809278430664678),
              (46, 0.07183677948082569),
              (193, 0.12809278430664678),
              (208, 0.07183677948082569),
              (360, 0.1012435219693227),
              (425, 0.032667399512587285),
              (522, 0.06263887269430637),
              (526, 0.060519753357584014),
              (561, 0.07678476325611504),
              (633, 0.06932317751401049),
              (670, 0.0643687593825565),
              (693, 0.04901102799914176),
              (828, 0.039782136054536725),
              (833, 0.05138631162979764),
              (967, 0.04327478427654896),
              (1051, 0.16722534176662762),
              (1090, 0.03610754381845898),
              (1197, 0.044710157508771727),
              (1238, 0.08802908140867095),
              (1264, 0.08022817102273508),
              (1527, 0.0958622198621064),
              (2181, 0.07027864086511042),
              (2298, 0.05216273583475883),
      

In [15]:
assert len(flattened_sparse) == len(term_document_tfidf.nonzero()[0])

In [24]:
import heapq
import re
def search_engine2(vocabulary:dict[str:int], inverted_index:dict[int: list[int, ...]], dataframe: pd.DataFrame) -> pd.DataFrame:
    """
    This function asks for a query and retrieves the 5 top rows from the input DataFrame whose pre-processed placeDesc entry contains all the words in the query. The 5 top rows are defined according to the cosine similarity of the placeDesc entry/document with the query. The pipeline completely relies on an inverted index data structure, which in turn relies on a vocabulary dictionary in order to be mapped back to the original string representations of the words/tokens.
    :param vocabulary: The vocabulary dictionary mapping the string representation for each token to the related integer index
    :param inverted_index: The inverted index mapping each index to a list of tuples, each containing a documents id and a related tf-idf value
    :param dataframe: The Pandas DataFrame from which to retrieve the places. Notice that the function assumes that the Dataframe has [placeName, placeURL and placeDesc] in its column index
    :return: a Pandas DataFrame with the top 5 documents according to cosine similarity with the query
    """
    query = input("Query: ") # Ask for input
    query_elements = re.split(r'\s+', query.lower()) # Split according to one or more whitespace with a simple expression
    query_elements = list(set(query_elements)) # Get only unique words
    output_docs = None
    token_ids = list(map(vocabulary.get, query_elements)) # Map each of the tokens to their ids/values in the dictionary. None is returned if not present. None is falsy


    for id in token_ids:
        # If a term is missing in the vocabulary, the loop is immediately stopped, and the DataFrame equivalent of None is returned by passing an empty list to Pandas indexing
        if not all(token_ids):
            output_docs = []
            break
        if not output_docs: # Output docs is None at first, and it evaluates to False
            output_docs = set([x[0] for x in inverted_index[id]]) # Initialize the set of the output docs (1st term)
        else:
            output_docs = output_docs.intersection(set([x[0] for x in inverted_index_tfidf[id]])) # Take the intersection for each term after the first


    # Now we need to sort according to TF-IDF and cosine similarity with the query
    output_docs = list(output_docs) # We need an ordered structure, set has no order
    output_array = np.empty(shape=(len(token_ids), len(output_docs))) # Allocate array which will contain the tfidf values for each of the query token for each of the output documents

    for iter_index, id in enumerate(token_ids):
        tf_idf_dict = dict(inverted_index_tfidf[id])
        output_array[iter_index] = list(map(tf_idf_dict.get, output_docs)) # each row contains the tf_idf values (one for each output doc) for a specific token


    # We are doing a lot in the following row. First of all we are computing the cosine similarity between each of the docs and the query (refer to the description above this function
    # definition to understand the reasoning). The second thing we do is creating an iterator of tuples (cosine similarity and document ids) with zip. Then we exhaust that iterator and move
    # its elements into a list
    heap_output = list(zip(-output_array.sum(0)/np.sqrt(output_array.shape[0]), output_docs))
    # Change the order of the elements of the list (in place) in order to make it represent a list. The result is a max heap because we have changed the sign of the entries heapify returns an array for a min heap
    heapq.heapify(heap_output)


    index_df = [] # Initialize list containing indexes of the documents/DataFrame rows
    cosine_sim = [] # Initialize list containing cosine similarity
    for i in range(min(5, len(heap_output))):
        doc = heapq.heappop(heap_output) # Repeated heappop after heapify is exactly equivalent to heapsort
        index_df.append(doc[1])
        cosine_sim.append(-doc[0])

    output_df = dataframe.iloc[list(index_df)][["placeName", "placeDesc", "placeURL"]]
    output_df["cosine_sim"] = cosine_sim
    output_df.placeURL = output_df.placeURL.str.replace(r"https://www.atlasobscura.com", "", regex = False) # Remove redundant part of the URL

    return output_df

In [27]:
search_engine2(vocabulary_word_index, inverted_index_tfidf, places)

Unnamed: 0,placeName,placeDesc,placeURL,cosine_sim
7070,Plum Brook Station (Neil A. Armstrong Test Fac...,The Neil A. Armstrong Test Facility (formerly ...,/places/plum-brook-station,0.311871
1111,Atomic Survival Town,In 1955 a series of 14 nuclear test explosions...,/places/atomic-survival-town,0.291292
2847,Fliegeberg,Berlin’s Fliegeberg (literally “Fly Mountain”)...,/places/fliegeberg,0.263563
6319,Rulison Nuclear Test Site,"On September 10, 1969, the United States Atomi...",/places/rulison-nuclear-test-site,0.254651
2327,Aerojet Dade Rocket Facility,The government and its contractors have a long...,/places/aerojet-dade-rocket-facility,0.253628
