# Calculating similarity measures between queries and sample documents  

Objectives are to demonestrate: 
- How to preprocess text and embedd textual data
-  Compare the results of textual similarity between tradditional and dep learning based methods   


*** Important consideration: You are not expected to use any particular library or any particular method; the codes below are just meant to provide you with some help so you spend most of your time on the deep learning based model. Feel free to choose your own methods. the evaluation is based on being able to obtain results regardless of which method is being used.  

# Set-up and import data 

from google.colab import files
uploaded = files.upload()

In [1]:
import json 

import warnings
warnings.filterwarnings("ignore")

with open('data/sample_repository.json') as in_file:
    test_data = json.load(in_file)

titles = [item[0] for item in test_data['data']]
documents = [item[1] for item in test_data['data']]

In [2]:
import pandas as pd
df = pd.DataFrame(list(zip(titles, documents)), columns =['titles', 'documents'])

In [3]:
# Query for all the models/approaches for semantic similarity
query = ['fruits', 'vegetables', 'healthy foods in Canada']

# TF-IDF - No Pre-Processing

In [4]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel

import nltk
nltk.download('stopwords', quiet=True)
nltk.download('punkt', quiet=True)

from nltk.corpus import stopwords

stop_words = set(stopwords.words('english'))

In [5]:
# Print Top n (10) results
def print_top(scores, df1, query, n=10):
#     print("\033[1m"+query.upper()+"\033[0m")
    print_list = []
    for i in scores.argsort()[-n:][::-1]:
#         print(df1.iloc[i,0])
        print_list.append(df1.iloc[i,0])
#     print("\n")
    return print_list

In [6]:
def tfidf_matching(df1, col, query, n):
    # Vectorize using TFIDF vectorizer
    vectorizer = TfidfVectorizer()
    vectors = vectorizer.fit_transform(df1[col].tolist())

    # Calculate the word frequency, and a measure of similarity (whatever you find it to be approperiate) of the search terms with each document
    from sklearn.feature_extraction.text import CountVectorizer
    cv = CountVectorizer()
    cv_fit = cv.fit_transform([query] + df1[col].tolist())
    
    # Get the word_list (Corpus)
    word_list = cv.get_feature_names()
    
    # Get the word Count_list
    count_list = cv_fit.toarray().sum(axis=0)

    # Create a dictionary of word_list and count_list
    wordcount_dict = dict(zip(word_list,count_list))

    # Vectorize the QUERY
    query_vec = vectorizer.transform([query])
    
    # Use linear_kernel to caculate the similarity
    results = linear_kernel(vectors,query_vec).reshape((-1,))
    
    # Output the similarity scores for top 5/10 documents and interpreat the findings and compare the results 
    results_list = print_top(results, df1, query, n)
    
    return results_list

In [7]:
# Semantic Similarity for all Queries
# Create a dictionary to store the results of the qury
results_dictionary = {}
for item in query:
    result = tfidf_matching(df, 'documents', item, n = 5)
    results_dictionary[item] = result

# print (results_dictionary)

df_results_tfidf_raw = pd.DataFrame.from_dict(results_dictionary)

# TF-IDF Repeat the same task after some preprocessing 

You are not expected to do any specific type of cleaning/standardizations but at minimum use 2 techniques (e.g. lemmatization, removing punctuations and etc.) 

In [8]:
from nltk import word_tokenize          
from nltk.stem import WordNetLemmatizer

nltk.download('words',quiet=True)
nltk.download('wordnet',quiet=True)

# Function for Pre Processing the text
def pre_processing(text):
    # Remove Punctuation
    import re
    text = re.sub(r"[^a-zA-Z0-9]", " ", text)
    
    # Tokenize Words
    words = word_tokenize(text.lower())
    
    # Remove Stop Words
    words = [w for w in words if w not in stopwords.words("english")]
    
    # Using Lemmetizer to get the simplest form of the word
    #words = [PorterStemmer().stem(w) for w in words]
    words = [WordNetLemmatizer().lemmatize(w, pos='v') for w in words]
    
    sentence = ""
    for word in words:
        sentence1 = word + " "
        sentence = sentence + sentence1
    
    return sentence

In [9]:
# Pre-processing the text and creating a new column 'processed'
df['processed'] = df.documents.apply(pre_processing)

In [10]:
# Semantic Similarity for all Queries
results_dictionary = {}
for item in query:
    result = tfidf_matching(df, 'processed', item, n = 5)
    results_dictionary[item] = result

# print (results_dictionary)

df_results_tfidf_processed = pd.DataFrame.from_dict(results_dictionary)

# Semantic matching using GloVe embeddings

In [11]:
#!pip install tfidf

In [12]:
#!pip install  gensim==4.0.1 # if you decide to use the gensim library and the sample codes below, you would need gensim version >=4.0.1 to be installed 
import gensim
print(gensim.__version__)

4.0.1


In [13]:
import logging
import json
import logging
from re import sub
from multiprocessing import cpu_count

import numpy as np

import gensim.downloader as api
from gensim.utils import simple_preprocess
from gensim.corpora import Dictionary
from gensim.models import TfidfModel
from gensim.similarities import WordEmbeddingSimilarityIndex
from gensim.similarities import SparseTermSimilarityMatrix
from gensim.similarities import SoftCosineSimilarity

In [14]:
import logging
# Initialize logging.
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.WARNING)

In [15]:
import nltk
# Import and download stopwords from NLTK.
nltk.download('stopwords', quiet = True)  # Download stopwords list.
stopwords = set(nltk.corpus.stopwords.words("english"))

In [16]:
def preprocess(doc):
    # Tokenize, clean up input document string
    doc = sub(r'<img[^<>]+(>|$)', " image_token ", doc)
    # you may decide to add additional steps here 
    return [token for token in simple_preprocess(doc, min_len=0, max_len=float("inf")) if token not in stopwords]

In [17]:
# Download and load the GloVe word vector embeddings
if 'glove' not in locals():  # only load if not already in memory
    glove = api.load("glove-wiki-gigaword-50")

similarity_index = WordEmbeddingSimilarityIndex(glove)

In [18]:
def glove_matching(similarity_index, df1, col, query_s, n):
    corpus = [preprocess(document) for document in df1[col]]
    
    query = preprocess(query_s)
    
    # Build the term dictionary, TF-idf model
    # Keep in mind that the search query must be in the dictionary as well, in case the terms do not overlap with the documents  
    dictionary = Dictionary(corpus+[query])
    tfidf = TfidfModel(dictionary=dictionary)
    
    # Create the term similarity matrix. 
    # The nonzero_limit enforces sparsity by limiting the number of non-zero terms in each column. 
    similarity_matrix = SparseTermSimilarityMatrix(similarity_index, dictionary, tfidf)  # , nonzero_limit=None)
    
    # Compute similarity measure between the query and the documents.
    query_tf = tfidf[dictionary.doc2bow(query)]

    index = SoftCosineSimilarity(
                tfidf[[dictionary.doc2bow(document) for document in corpus]],
                similarity_matrix)

    doc_similarity_scores = index[query_tf]

    # Output the similarity scores for top 5/10 documents and interpreat the findings and compare the results 
    results_list = print_top(doc_similarity_scores, df1, query_s, n)

    return results_list

In [19]:
# Semantic Similarity for all Queries
results_dictionary = {}
for item in query:
    result = glove_matching(similarity_index, df, 'documents', item, 5)
    results_dictionary[item] = result

# print (results_dictionary)

df_results_tfidf_glove = pd.DataFrame.from_dict(results_dictionary)

100%|█████████████████████████████████████████| 568/568 [00:05<00:00, 96.91it/s]
100%|█████████████████████████████████████████| 568/568 [00:05<00:00, 97.83it/s]
100%|█████████████████████████████████████████| 568/568 [00:05<00:00, 97.97it/s]


# BERT
Use a bert model to create sentence embeddings and calculate the similarity between queries and documents.

In [20]:
from sentence_transformers import SentenceTransformer, util
import torch

#using model all-MiniLM-L12-v2 which is an All-round model tuned for many use-cases. 
# it is trained on a large and diverse dataset of over 1 billion training pairs.
embedder = SentenceTransformer('all-MiniLM-L12-v2', )

In [21]:
# Function for Semantic Matching using Bert
def bert_semantic_matching(embedder, df1, col, query, n):
    # Corpus of documents
    corpus = df1[col]+[query]
    
    # Corpus Empeddings
    corpus_embeddings = embedder.encode(corpus, convert_to_tensor=True)
    
    # Queries Embeddings
    queries_embeddings = embedder.encode(query, convert_to_tensor=True)
    
    # Compute cosine-similarities for each sentence with each other sentence
    cosine_scores = util.pytorch_cos_sim(corpus_embeddings, queries_embeddings)
    
    # Reshape to numpy array
    cosine_scores = cosine_scores.numpy().reshape(32)
    
    # Output the similarity scores for top 5/10 documents and interpreat the findings and compare the results 
    results_list = print_top(cosine_scores, df1, query, n)
    
    return results_list

In [22]:
# Semantic Similarity for all Queries
results_dictionary = {}
for item in query:
    result = bert_semantic_matching(embedder, df, 'documents', item, 5)
    results_dictionary[item] = result

# print (results_dictionary)

df_results_tfidf_bert = pd.DataFrame.from_dict(results_dictionary)

# Compare the findings

In [23]:
def create_Dataframe(col,df_results_tfidf_raw,df_results_tfidf_processed,df_results_tfidf_glove,df_results_tfidf_bert):
    df= pd.DataFrame()
    df['raw'] = df_results_tfidf_raw[col]
    df['processed'] = df_results_tfidf_processed[col]
    df['glove'] = df_results_tfidf_glove[col]
    df['bert'] = df_results_tfidf_bert[col]
    return df

In [99]:
def print_documents(df1):
    list_results = set(df1.raw.to_list() + df1.processed.to_list()  + df1.glove.to_list() + df1.bert.to_list())
    list_results = list(list_results)
    title_list = df.titles[df.titles.isin(list_results)].to_list()
    documents_list = df.documents[df.titles.isin(list_results)].to_list()
    for i in range (0, len(list_results)):
        print ("\033[1m" + title_list[i] + "\033[0m" )
        print (documents_list[i])

## FRUITS

In [100]:
df_fruits = create_Dataframe(query[0], df_results_tfidf_raw,df_results_tfidf_processed,df_results_tfidf_glove,df_results_tfidf_bert)
print_documents(df_fruits)
df_fruits

[1mPomegranate Bhagwa[0m
Fresh Pomegranate from Anushka Avni International Bhagwa is a premium Pomegranate variety from India. The Deep Red arils & the pleasing Red but rugged skin enhances the appearance whilst promoting shelf life of the fruit. Bhagwa is widely known for its soft seed, Dark red color and extremely delicious. Packaging: Net weight of box 2.5kg, 3.00kg, 3.5kg. Details: Minimum Weight 180gm, maximum weight 400gm Color of arils: Dark Cherry red. Taste: Sweet Fruit count / carton (3.50 kg net wt.) 9 Numbers packed per carton: 350-400gms
[1mPomegranate Arakta[0m
Fresh Pomegranate Arakta from Anushka Avni International This Pomegranate are bigger in size, sweet with soft seeds, bold red arils. It also possess glossy, attractive, dark red skin. Packaging: Net weight of box 2.5kg, 3.00kg, 3.5kg. Details: Minimum Weight 180gm, maximum weight 400gm Taste: Sweet Fruit count / carton (3.50 kg net wt.) 9 Numbers packed per carton: 350-400gms 10 Numbers packed per carton :290-3

Unnamed: 0,raw,processed,glove,bert
0,Food classes,fruit serving bowl,Food classes,Food classes
1,Canada's Food Guide,Neuro linguistic programming,fruit serving bowl,List of fruit dishes
2,fruit serving bowl,Pomegranate Arakta,List of fruit dishes,Tomatoes
3,Neuro linguistic programming,About Us,Pomegranate Bhagwa,Pomegranate Bhagwa
4,Pomegranate Arakta,Contact Us,Canada's Food Guide,Grapes Flame / Red Seedless


### Comparision
for the amount of documents and the corpus available, using raw is better than using processed corpus. This can be seen by selection of NLP, About Us and Contact Us titles in the data model which was developed using processed text. This is most likely due to the loss of context.

Both Golve and Bert based models were able to keep provide better semantic matching that simple TF-IDF models. This is because of the developed word and sentence embeddings used a large data set (corpus) for training). Glove used word embeddings while bert used sentence embeddings. 

Reading though the documents, in my opinion, bert sentence embeddings model was able to better semantics matching than glove model. This is clear from the document titled tomatoes and grapes which do not show up anywhere else.

## VEGETABLES

In [103]:
df_vegetables = create_Dataframe(query[1], df_results_tfidf_raw,df_results_tfidf_processed,df_results_tfidf_glove,df_results_tfidf_bert)
print_documents(df_vegetables)
df_vegetables

[1mPomegranate Bhagwa[0m
Fresh Pomegranate from Anushka Avni International Bhagwa is a premium Pomegranate variety from India. The Deep Red arils & the pleasing Red but rugged skin enhances the appearance whilst promoting shelf life of the fruit. Bhagwa is widely known for its soft seed, Dark red color and extremely delicious. Packaging: Net weight of box 2.5kg, 3.00kg, 3.5kg. Details: Minimum Weight 180gm, maximum weight 400gm Color of arils: Dark Cherry red. Taste: Sweet Fruit count / carton (3.50 kg net wt.) 9 Numbers packed per carton: 350-400gms
[1mPomegranate Arakta[0m
Fresh Pomegranate Arakta from Anushka Avni International This Pomegranate are bigger in size, sweet with soft seeds, bold red arils. It also possess glossy, attractive, dark red skin. Packaging: Net weight of box 2.5kg, 3.00kg, 3.5kg. Details: Minimum Weight 180gm, maximum weight 400gm Taste: Sweet Fruit count / carton (3.50 kg net wt.) 9 Numbers packed per carton: 350-400gms 10 Numbers packed per carton :290-3

Unnamed: 0,raw,processed,glove,bert
0,Canada's Food Guide,Canada's Food Guide,Food classes,Small Onions
1,fruit serving bowl,fruit serving bowl,Canada's Food Guide,List of fruit dishes
2,Neuro linguistic programming,Neuro linguistic programming,List of fruit dishes,Tomatoes
3,Pomegranate Arakta,Pomegranate Arakta,Small Onions,Food classes
4,About Us,About Us,Pomegranate Bhagwa,Grapes Flame / Red Seedless


### Comparision
For this query, there was no difference in the top 5 results for the TF-IDF model. This is most likely because of the limited corups and documents.
Bert and glove provided better results. Its hard to quantify the accuracy of these two models, but the results are more semantically matching that the TF-IDF models.

## HEALTHY FOODS IN CANADA

In [102]:
df_health = create_Dataframe(query[2], df_results_tfidf_raw,df_results_tfidf_processed,df_results_tfidf_glove,df_results_tfidf_bert)
print_documents(df_health)
df_health

[1mPomegranate Bhagwa[0m
Fresh Pomegranate from Anushka Avni International Bhagwa is a premium Pomegranate variety from India. The Deep Red arils & the pleasing Red but rugged skin enhances the appearance whilst promoting shelf life of the fruit. Bhagwa is widely known for its soft seed, Dark red color and extremely delicious. Packaging: Net weight of box 2.5kg, 3.00kg, 3.5kg. Details: Minimum Weight 180gm, maximum weight 400gm Color of arils: Dark Cherry red. Taste: Sweet Fruit count / carton (3.50 kg net wt.) 9 Numbers packed per carton: 350-400gms
[1mAbout Us[0m
About Us Anushka Avni International (AAI) takes pleasure in presenting itself as one of the renowned Suppliers and Exporter. We have huge assortment of agro products available with us. We feel proud when buyers come to us recognizing the standard quality which we offer in the world wide market. We follow the best practices while supplying… Read More..
[1mDiet[0m
In nutrition, the diet of an organism is the sum of foods

Unnamed: 0,raw,processed,glove,bert
0,Canada's Food Guide,Canada's Food Guide,Canada's Food Guide,Canada's Food Guide
1,Diet,Diet,Diet,Canadian Industry Statistics
2,Canadian Industry Statistics,Canadian Industry Statistics,fruit serving bowl,List of fruit dishes
3,Ford Bronco,Major Market,About Us,Major Market
4,Major Market,Ford Bronco,Pomegranate Bhagwa,Grapes Flame / Red Seedless


### Comparison
Continuing the comparison of the last query, BERT model clearly has better semantic matching within the available documents. 