#### Information Retrieval :: Word Embeddings for Information Retrieval and Query Expansion


# Word Embeddings for Information Retrieval and Query Expansion


*Goals of this tutorial:* In this tutorial the information retrieval engine will be improved by: (i) directly match the query and the document in the latent semantic space of word embeddings; (ii) expand the original query via word embeddings.

<br>*Reference:* Christopher D. Manning publication "Introduction to Information Retrieval" Chap 9: Relevance feedback and query expansion.

## Part 0. Dataset and Parsing (The same as Modeling_and_Ranking_Texts part)

The dataset is collected from Quizlet (https://quizlet.com), a website where users can generated their own flashcards. Each flashcard generated by a user is made up of an entity on the front and a definition describing or explaining the entity correspondingly on the back. We treat entities on each flashcard's front as the queries and the definitions on the back of flashcards as the documents. Definitions (documents) are relevant to an entity (query) if the definitions are from the back of the entity's flashcard; otherwise definitions are not relevant. **In this homework, queries and entities are interchangeable as well as documents and definitions.**

The format of the dataset is like this:

**query \t document id \t document**

Examples:

decision tree	\t 27946 \t	show complex processes with multiple decision rules.  display decision logic (if statements) as set of (nodes) questions and branches (answers).

where "decision tree" is the entity in the front of a flashcard and "show complex processes with multiple decision rules.  display decision logic (if statements) as set of (nodes) questions and branches (answers)." is the definition on the flashcard's back and "27946" is the id of the definition. Naturally, this document is relevant to the query.

false positive rate	\t 686	\t fall-out; probability of a false alarm

where document 686 is not relevant to query "decision tree" because the entity of "fall-out; probability of a false alarm" is "false positive rate".

For parsing this dataset, you could also just copy your code from homework 1 to complete the following tasks:
* Tokenize documents (definitions) using **whitespaces and punctuations as delimiters**.
* Remove stop words: use nltk stop words list (from nltk.corpus import stopwords)
* Stemming: use [nltk Porter stemmer](http://www.nltk.org/api/nltk.stem.html#module-nltk.stem.porter)
* Remove any other strings that you think are less informative or nosiy.

In [0]:
#!pip install nltk
# Build document repo
from collections import OrderedDict
with open('homework_1_data.txt', encoding='utf8') as f:
        lines = f.readlines()
# Sorting entities alphabetically
documentRepo = OrderedDict()
for line in lines:
    s = line.split('\t')
    entity = s[0]
    def_id = s[1]
    definition = s[2].replace('\n','')
    documentRepo[def_id] = [entity, definition]

# Part 1: Word2Vec

In this part the Word2Vec algorithm will be used to generate word embeddings for tokens in the dataset. You can just use a package like https://radimrehurek.com/gensim/models/word2vec.html. Let's set the size of word embeddings to be 20. Please print the word embeddings for the tokens: 
* relational
* database
* garbage
* collection
* retrieval 
* model

In [0]:
# code here.
# how do you generate the word embeddings
from gensim.test.utils import common_texts, get_tmpfile
import gensim 
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
import re
nltk.download('stopwords')
porter = PorterStemmer()

def read_input(input_file):
    documents = []
    with open(input_file, encoding='utf8') as f:
        filtered_line = ""
        for line in f:
            line = re.sub(r'[:=*?`@%+!&";#{}\[\]\'^|\~]','', line.split('\t')[2].replace('\n',''))
            line = re.sub(r'[,-\\—()\.<>]', ' ', line) #replace seperating punctuations with space to be split later            
            line = re.sub(r'[^A-Za-z0-9❖•]', ' ', line)
            line = re.sub(r'[❖•]', ' ', line)
            line = ' '.join([porter.stem(word) for word in line.split() if word not in set(stopwords.words("english"))])
            yield gensim.utils.simple_preprocess (line)
    return documents
# read the tokenized reviews into a list
# each review item becomes a serries of words
# so this becomes a list of lists
documents = list (read_input ('homework_1_data.txt'))

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [0]:
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
nltk.download('stopwords')
import re
porter = PorterStemmer()
from collections import OrderedDict
with open('homework_1_data.txt', encoding='utf8') as f:
        lines = f.readlines()
stemDocumentRepo = OrderedDict()
for line in lines:
    s = line.split('\t')
    entity = s[0]
    def_id = s[1]
    line = re.sub(r'[:=*?`@%+!&";#{}\[\]\'^|\~]','', line.split('\t')[2].replace('\n',''))
    line = re.sub(r'[,-\\—()\.<>]', ' ', line) #replace seperating punctuations with space to be split later            
    line = re.sub(r'[^A-Za-z0-9❖•]', ' ', line)
    line = re.sub(r'[❖•]', ' ', line)
    line = ' '.join([porter.stem(word) for word in line.split() if word not in stopwords.words("english")])
    definition = line
    stemDocumentRepo[def_id] = [entity, definition]

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [0]:
len(list(documents))

30917

In [0]:
model = gensim.models.Word2Vec (documents, size=20, window=10, min_count=2, workers=10)
model.train(documents, total_examples=len(documents), epochs=30)
model.save("word2vec_hw4.model")

  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL


In [0]:
from gensim.models import Word2Vec
model = Word2Vec.load("word2vec_hw4.model")
#https://github.com/kavgan/nlp-in-practice/blob/master/word2vec/Word2Vec.ipynb
queries = "relational database garbage collection retrieval model"
for query in queries.split():
    print ("Original Query: " + str(query))
    print ("Parsed Query: " + str(porter.stem(query)))
    print (model.wv[porter.stem(query)])

Original Query: relational
Parsed Query: relat
[ 2.4325955  -1.5862104   0.9993653  -2.3630714  -3.6648803  -1.400936
 -1.5606936  -0.12180549  2.5645795  -1.4470593   0.6036795  -1.8059071
 -0.59320927  0.56289136  5.02749     4.4252834   2.4621477  -2.636864
  4.0297747   0.17841956]
Original Query: database
Parsed Query: databas
[ 3.2639737  -0.40196422  1.2192895  -0.67061245  0.30400693  2.2527425
 -3.38287    -1.1784745   0.04066178 -4.403574    2.075228   -0.26188228
 -2.9757793  -0.04731051  5.0337152   3.7241054   1.0659866  -4.2606587
  1.4258126  -3.2073882 ]
Original Query: garbage
Parsed Query: garbag
[ 0.56764954  0.4797237   0.3522163   0.14468074  1.0309417  -0.21139057
 -1.0061702  -1.1167006  -1.0311371  -0.58094466  0.27873987  1.4679729
  0.08956432  2.0702474  -0.6433034  -0.29034895  2.1073434   1.1227099
  1.588469    0.16232611]
Original Query: collection
Parsed Query: collect
[ 3.2466333   2.0754826   0.8773468  -1.1259906   1.869884    3.6180139
 -2.4340394  -

  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL


# Part 2: Vector Space Model via Word Embeddings

In this part, the job is to match the query and the document via the cosine similarity between the embeddings of them.

Since there are not just one token in a query or a document, the first challenge is how to aggregate many word embeddings into one embedding of a query or a document. There are many ways to do so: 
* Max pooling: return the maximum value along each dimension of a bunch of word embeddings. For example, [1, 3, 4], [2, 1, 5] -> [2, 3, 5].
* Min pooling: return the minimum value along each dimension of a bunch of word embeddings
* Mean pooling: return the mean value along each dimension of a bunch of word embeddings
* Sum: element-wise add a bunch of word embeddings together
* Weighted sum: assign weights to word embeddings and then add them together. Weights could be TF, IDF or TF-IDF.

In [0]:
import numpy as np
import sys
import math

def vector_pooling(querys, definition, model, method):
    # query_vec # query vector
    # doc_vec # document vector
    query_vec = [] * 20
    doc_vec = [] * 20
    if method == "max":
        query_vec = [-sys.maxsize] * 20
        doc_vec = [-sys.maxsize] * 20
        for query in querys.split():
            query_vec = np.maximum(np.array(query_vec), model.wv[porter.stem(query)])
        for word in definition.split():
            try:
                doc_vec   = np.maximum(doc_vec, model.wv[word])
            except KeyError:
                continue

    elif method == "min":
        query_vec = [sys.maxsize] * 20
        doc_vec = [sys.maxsize] * 20
        for query in querys.split():
            query_vec = np.minimum(np.array(query_vec), model.wv[porter.stem(query)])
        for word in definition.split():
            try:
                doc_vec   = np.minimum(doc_vec, model.wv[word])
            except KeyError:
                continue

    elif method == "mean":
        query_list = [0] * 20
        count_q = len(querys.split())
        for query in querys.split():
            query_list = np.add(query_list, model.wv[porter.stem(query)])
        query_vec = [x/count_q for x in query_list]

        doc_list = [0] * 20
        count_d = len(definition.split())
        for word in definition.split():
            try:
                doc_list = np.add(doc_list, model.wv[word])
            except KeyError:
                continue
        doc_vec = [x/count_q for x in doc_list]

    elif method == "sum":
        query_vec = [0] * 20
        for query in querys.split():
            query_vec = np.add(query_vec, model.wv[porter.stem(query)])

        doc_vec = [0] * 20
        for word in definition.split():
            try:
                doc_vec = np.add(doc_vec, model.wv[word])
            except KeyError:
                continue

    elif method == "weight":
        query_list = [0] * 20
        term = []
        count_q = len(querys.split())
        for query in querys.split():
            term.append(porter.stem(query))
            query_list = np.add(query_list, model.wv[porter.stem(query)])
        query_vec = [x/count_q for x in query_list]

        count_d = len(definition.split())
        tf = 1
        doc_list = [0] * 20
        for word in definition.split():
            try:
                if word in term:
                    if definition.count(word) > 0:
                        tf = 1 + math.log10(definition.count(word))
                    doc_list = np.add(doc_list, tf * model.wv[word])
                else:
                    doc_list = np.add(doc_list, 0.05 * model.wv[word])
            except KeyError:
                continue
        if count_d == 0:
            doc_vec = [0] * 20
        else:
            doc_vec = [x/count_d for x in doc_list]

    return query_vec, doc_vec

def vector_space_model_with_embeddings (querys, definition, model, method):
    # return the cosine similarity of query_vec and doc_vec
    query_vec, doc_vec = vector_pooling(querys, definition, model, method)
    dot_product = sum(i[0] * i[1] for i in zip(query_vec, doc_vec))
    if np.linalg.norm(doc_vec) == 0:
        cosine = -1
    else:
        cosine = dot_product / (np.linalg.norm(query_vec) * np.linalg.norm(doc_vec))
    return cosine

def rank_querys_docs (stemDocumentRepo, querys, model):
    maxScores = {}
    minScores = {}
    meanScores = {}
    sumScores = {}
    weightScores = {}
    for i in range (len(stemDocumentRepo)):
        _, stemdefinition = stemDocumentRepo[str(i)]
        # Max method
        maxScores[i]  = vector_space_model_with_embeddings(querys, stemdefinition, model, "max")
        # Max method
        minScores[i]  = vector_space_model_with_embeddings(querys, stemdefinition, model, "min")
        # Mean method
        meanScores[i] = vector_space_model_with_embeddings(querys, stemdefinition, model, "mean")
        # Sum method
        sumScores[i]  = vector_space_model_with_embeddings(querys, stemdefinition, model, "sum")
        # TF weighting method
        weightScores[i] = vector_space_model_with_embeddings(querys, stemdefinition, model, "weight")
    return maxScores, minScores, meanScores, sumScores, weightScores

def calculate_precision10 (docScores, query, type):
    printTopN = 10
    count = 0
    #print ("#######" + str(type) + "#######")
    for docID, score in sorted(docScores.items(), key=lambda item: item[1], reverse = True):
        #print (str(documentRepo[str(docID)][0]) + " "+ str(score))
        if printTopN > 0:
            if query == documentRepo[str(docID)][0]:
                count = count + 1
            printTopN-=1
        else:
            break
    return count/10

In [0]:
# your code here
query = "relational database"
maxScores, minScores, meanScores, sumScores, weightScores = rank_querys_docs (stemDocumentRepo, query, model)
print ("Query: " + str(query))
print ("Precision@10: ")
maxprecision = calculate_precision10(maxScores, query, "max")
minprecision = calculate_precision10(minScores, query, "min")
meanprecision = calculate_precision10(meanScores, query, "mean")
sumprecision = calculate_precision10(sumScores, query, "sum")
weightprecision = calculate_precision10(weightScores, query, "weight")
print ("Max pooling: " + str(maxprecision))
print ("Min pooling: " + str(minprecision))
print ("Mean pooling: " + str(meanprecision))
print ("Sum pooling: " + str(sumprecision))
print ("Weighted sum (TF): " + str(weightprecision))

query = "garbage collection"
maxScores, minScores, meanScores, sumScores, weightScores = rank_querys_docs (stemDocumentRepo, query, model)
print ("Query: " + str(query))
print ("Precision@10: ")
maxprecision = calculate_precision10(maxScores, query, "max")
minprecision = calculate_precision10(minScores, query, "min")
meanprecision = calculate_precision10(meanScores, query, "mean")
sumprecision = calculate_precision10(sumScores, query, "sum")
weightprecision = calculate_precision10(weightScores, query, "weight")
print ("Max pooling: " + str(maxprecision))
print ("Min pooling: " + str(minprecision))
print ("Mean pooling: " + str(meanprecision))
print ("Sum pooling: " + str(sumprecision))
print ("Weighted sum (TF): " + str(weightprecision))

query = "retrieval model"
maxScores, minScores, meanScores, sumScores, weightScores = rank_querys_docs (stemDocumentRepo, query, model)
print ("Query: " + str(query))
print ("Precision@10: ")
maxprecision = calculate_precision10(maxScores, query, "max")
minprecision = calculate_precision10(minScores, query, "min")
meanprecision = calculate_precision10(meanScores, query, "mean")
sumprecision = calculate_precision10(sumScores, query, "sum")
weightprecision = calculate_precision10(weightScores, query, "weight")
print ("Max pooling: " + str(maxprecision))
print ("Min pooling: " + str(minprecision))
print ("Mean pooling: " + str(meanprecision))
print ("Sum pooling: " + str(sumprecision))
print ("Weighted sum (TF): " + str(weightprecision))

Query: relational database
Precision@10: 
Max pooling: 0.4
Min pooling: 0.6
Mean pooling: 0.4
Sum pooling: 0.4
Weighted sum (TF): 0.5
Query: garbage collection
Precision@10: 
Max pooling: 0.0
Min pooling: 0.0
Mean pooling: 0.0
Sum pooling: 0.0
Weighted sum (TF): 0.5
Query: retrieval model
Precision@10: 
Max pooling: 0.0
Min pooling: 0.0
Mean pooling: 0.0
Sum pooling: 0.0
Weighted sum (TF): 0.0


Try different aggregation methods and report the precision@10 for these queries:
* query: relational database
* query: garbage collection
* query: retrieval model

### Discussion
Among these aggregation methods, which one is the best and which one is the worst?
<br>**[ANS]**
<br>Based on the experimental results of precision@10, the best method is the **Weighted sum (TF)**, the worst method is the **max,mean,sum** method.

# Part 3: Query Expansion via Word Embeddings

Remember the hardest query "retrieval model" in homework 1? Because there is no document containing "retrieval model" in the dataset, you cannot retrieve any documents by Boolean matching. Now, it is the time of your "revenge" via query expansion.

In this part, your job is to expand the original query like "retrieval model" by adding semantically similar words (e.g., "search"), which are selected from all tokens in the dataset.

There are many ways to do so. For this part, we want you to calculate the cosine similarity between each of the original query tokens and the other tokens based on their word embeddings.

First, please find the top 3 similar tokens for:
* relational
* database
* garbage
* collection
* retrieval 
* model

In [0]:
queries = "relational database garbage collection retrieval model"
# Calculate the relevant term for each query
query1 = "relational database"
query2 = "garbage collection"
query3 = "retrieval model"
relevant1 = 0
relevant2 = 0
relevant3 = 0

for i in range (len(stemDocumentRepo)):
        entity, _ = documentRepo[str(i)]
        if entity == query1:
            relevant1 += 1
        elif entity == query2:
            relevant2 += 1
        elif entity == query3:
            relevant3 += 1
print ("The number of " + str(query1) + ": " + str(relevant1))
print ("The number of " + str(query2) + ": " + str(relevant2))
print ("The number of " + str(query3) + ": " + str(relevant3))

The number of relational database: 284
The number of garbage collection: 38
The number of retrieval model: 2


In [0]:
import numpy as np
import sys
        
def vector_pooling(querys, definition, model, method):
    # query_vec # query vector
    # doc_vec # document vector
    query_vec = [] * 20
    doc_vec = [] * 20
    if method == "max":
        query_vec = [-sys.maxsize] * 20
        doc_vec = [-sys.maxsize] * 20
        for query in querys.split():
            query_vec = np.maximum(np.array(query_vec), model.wv[query])
        for word in definition.split():
            try:
                doc_vec   = np.maximum(doc_vec, model.wv[word])
            except KeyError:
                continue

    elif method == "min":
        query_vec = [sys.maxsize] * 20
        doc_vec = [sys.maxsize] * 20
        for query in querys.split():
            query_vec = np.minimum(np.array(query_vec), model.wv[query])
        for word in definition.split():
            try:
                doc_vec   = np.minimum(doc_vec, model.wv[word])
            except KeyError:
                continue

    elif method == "mean":
        query_list = [0] * 20
        count_q = len(querys.split())
        for query in querys.split():
            query_list = np.add(query_list, model.wv[query])
        query_vec = [x/count_q for x in query_list]

        doc_list = [0] * 20
        count_d = len(definition.split())
        for word in definition.split():
            try:
                doc_list = np.add(doc_list, model.wv[word])
            except KeyError:
                continue
        doc_vec = [x/count_q for x in doc_list]

    elif method == "sum":
        query_vec = [0] * 20
        for query in querys.split():
            query_vec = np.add(query_vec, model.wv[query])

        doc_vec = [0] * 20
        for word in definition.split():
            try:
                doc_vec = np.add(doc_vec, model.wv[word])
            except KeyError:
                continue

    elif method == "weight":
        query_list = [0] * 20
        term1 = [] # original query
        term2 = [] # expanded query
        idx = 0
        count_q = 4 # original 2, expanded 6 * 1/3 = 2
        for query in querys.split():
            idx = idx + 1
            if idx <= 2:
                term1.append(query)
                query_list = np.add(query_list, model.wv[query])
            else:
                term2.append(query)
                query_list = np.add(query_list, model.wv[query]/3)
        query_vec = [x/count_q for x in query_list]

        count_d = len(definition.split())
        tf = 1
        doc_list = [0] * 20
        for word in definition.split():
            try:
                if word in term1:
                    if definition.count(word) > 0:
                        tf = 1 + math.log10(definition.count(word))
                    doc_list = np.add(doc_list, tf * model.wv[word])
                elif word in term2:
                    if definition.count(word) > 0:
                        tf = 1 + math.log10(definition.count(word))
                    doc_list = np.add(doc_list, tf/3 * model.wv[word])
                else:
                    doc_list = np.add(doc_list, 0.05 * model.wv[word])
            except KeyError:
                continue
        if count_d == 0:
            doc_vec = [0] * 20
        else:
            doc_vec = [x/count_d for x in doc_list]

    return query_vec, doc_vec

def vector_space_model_with_embeddings (querys, definition, model, method):
    # return the cosine similarity of query_vec and doc_vec
    query_vec, doc_vec = vector_pooling(querys, definition, model, method)
    dot_product = sum(i[0] * i[1] for i in zip(query_vec, doc_vec))
    if np.linalg.norm(doc_vec) == 0:
        cosine = -1
    else:
        cosine = dot_product / (np.linalg.norm(query_vec) * np.linalg.norm(doc_vec))
    return cosine

def rank_querys_docs (stemDocumentRepo, querys, model):
    maxScores = {}
    minScores = {}
    meanScores = {}
    sumScores = {}
    weightScores = {}
    for i in range (len(stemDocumentRepo)):
        _, stemdefinition = stemDocumentRepo[str(i)]
        entity, _ = documentRepo[str(i)]
        # Max method
        maxScores[i]  = vector_space_model_with_embeddings(querys, stemdefinition, model, "max")
        # Max method
        minScores[i]  = vector_space_model_with_embeddings(querys, stemdefinition, model, "min")
        # Mean method
        meanScores[i] = vector_space_model_with_embeddings(querys, stemdefinition, model, "mean")
        # Sum method
        sumScores[i]  = vector_space_model_with_embeddings(querys, stemdefinition, model, "sum")
        # TF weighting method
        weightScores[i] = vector_space_model_with_embeddings(querys, stemdefinition, model, "weight")
    return maxScores, minScores, meanScores, sumScores, weightScores


In [0]:
# your code here
import warnings  
warnings.filterwarnings(action='ignore',category=UserWarning,module='gensim')  

from gensim.models import Word2Vec
from nltk.stem import PorterStemmer
porter = PorterStemmer()

model = Word2Vec.load("word2vec_hw4.model")
tokens = "relational database garbage collection retrieval model"

print(model.wv.most_similar(positive=porter.stem('relational'), topn=3))
print(model.wv.most_similar(positive=porter.stem('database'), topn=3))
print(model.wv.most_similar(positive=porter.stem('garbage'), topn=3))
print(model.wv.most_similar(positive=porter.stem('collection'), topn=3))
print(model.wv.most_similar(positive=porter.stem('retrieval'), topn=3))
print(model.wv.most_similar(positive=porter.stem('model'), topn=3))

[('tabl', 0.8046892881393433), ('entiti', 0.7562270164489746), ('common', 0.7351295351982117)]
[('dbm', 0.8211696147918701), ('collect', 0.7607239484786987), ('compris', 0.7468408942222595)]
[('collector', 0.8240309953689575), ('longer', 0.8046669960021973), ('reclaim', 0.7889659404754639)]
[('repositori', 0.7911487817764282), ('data', 0.7804063558578491), ('gather', 0.7676119208335876)]
[('store', 0.873002827167511), ('data', 0.7816286087036133), ('metadata', 0.7605143785476685)]
[('mathemat', 0.787505030632019), ('breakthrough', 0.7493785619735718), ('illustr', 0.7133469581604004)]


  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL
  if np.issubdtype(vec.dtype, np.int):


Second, please add these similar tokens to the orignal query and redo the **vector space model** in part 2. 
* query: relational database
* query: garbage collection
* query: retrieval model

In [0]:
# your code here
def calculate_recall_10 (docScores, query, total_relevants):
    printTopN = 10
    count = 0
    #print ("##############")
    for docID, score in sorted(docScores.items(), key=lambda item: item[1], reverse = True):
        if printTopN > 0:
            #print (str(documentRepo[str(docID)][0]) + " "+ str(score))
            if query == documentRepo[str(docID)][0]:
                count = count + 1
            printTopN-=1
        else:
            break
    return count/total_relevants

query1 = "relational database"
query1_stem = "relat databas"
maxScores, minScores, meanScores, sumScores, weightScores = rank_querys_docs (stemDocumentRepo, query1_stem, model)
print ("Query: " + str(query1))
print ("Recall@10: ")
maxprecision = calculate_recall_10(maxScores, query1, relevant1)
minprecision = calculate_recall_10(minScores, query1, relevant1)
meanprecision = calculate_recall_10(meanScores, query1, relevant1)
sumprecision = calculate_recall_10(sumScores, query1, relevant1)
weightprecision = calculate_recall_10(weightScores, query1, relevant1)
print ("Max pooling: " + str(maxprecision))
print ("Min pooling: " + str(minprecision))
print ("Mean pooling: " + str(meanprecision))
print ("Sum pooling: " + str(sumprecision))
print ("Weighted sum: " + str(weightprecision))

query2 = "garbage collection"
query2_stem = "garbag collect"
maxScores, minScores, meanScores, sumScores, weightScores = rank_querys_docs (stemDocumentRepo, query2_stem, model)
print ("Query: " + str(query2))
print ("Recall@10: ")
maxprecision = calculate_recall_10(maxScores, query2, relevant2)
minprecision = calculate_recall_10(minScores, query2, relevant2)
meanprecision = calculate_recall_10(meanScores, query2, relevant2)
sumprecision = calculate_recall_10(sumScores, query2, relevant2)
weightprecision = calculate_recall_10(weightScores, query2, relevant2)
print ("Max pooling: " + str(maxprecision))
print ("Min pooling: " + str(minprecision))
print ("Mean pooling: " + str(meanprecision))
print ("Sum pooling: " + str(sumprecision))
print ("Weighted sum: " + str(weightprecision))

query3 = "retrieval model"
query3_stem = "retriev model"
maxScores, minScores, meanScores, sumScores, weightScores = rank_querys_docs (stemDocumentRepo, query3_stem, model)
print ("Query: " + str(query3))
print ("Recall@10: ")
maxprecision = calculate_recall_10(maxScores, query3, relevant3)
minprecision = calculate_recall_10(minScores, query3, relevant3)
meanprecision = calculate_recall_10(meanScores, query3, relevant3)
sumprecision = calculate_recall_10(sumScores, query3, relevant3)
weightprecision = calculate_recall_10(weightScores, query3, relevant3)
print ("Max pooling: " + str(maxprecision))
print ("Min pooling: " + str(minprecision))
print ("Mean pooling: " + str(meanprecision))
print ("Sum pooling: " + str(sumprecision))
print ("Weighted sum: " + str(weightprecision))

Query: relational database
Recall@10: 
Max pooling: 0.014084507042253521
Min pooling: 0.02112676056338028
Mean pooling: 0.014084507042253521
Sum pooling: 0.014084507042253521
Weighted sum: 0.017605633802816902
Query: garbage collection
Recall@10: 
Max pooling: 0.0
Min pooling: 0.0
Mean pooling: 0.0
Sum pooling: 0.0
Weighted sum: 0.13157894736842105
Query: retrieval model
Recall@10: 
Max pooling: 0.0
Min pooling: 0.0
Mean pooling: 0.0
Sum pooling: 0.0
Weighted sum: 0.0


Report recall@10 before the query expansion:
<br>**[ANS] As above**
<br>**Query: relational database**
<br>Recall@10: 
<br>Max pooling: 0.014084507042253521
<br>Min pooling: 0.02112676056338028
<br>Mean pooling: 0.014084507042253521
<br>Sum pooling: 0.014084507042253521
<br>Weighted sum: 0.017605633802816902
<br>
<br>**Query: garbage collection**
<br>Recall@10: 
<br>Max pooling: 0.0
<br>Min pooling: 0.0
<br>Mean pooling: 0.0
<br>Sum pooling: 0.0
<br>Weighted sum: 0.13157894736842105
<br>
<br>**Query: retrieval model**
<br>Recall@10: 
<br>Max pooling: 0.0
<br>Min pooling: 0.0
<br>Mean pooling: 0.0
<br>Sum pooling: 0.0
<br>Weighted sum: 0.0

Report recall@10 after the query expansion:
<br>**[ANS] As below**
<br>**Query: relational database**
<br>Query Expansion:  relat tabl entiti common databas dbm collect compris
<br>Recall@10: 
<br>Max pooling: 0.014084507042253521
<br>Min pooling: 0.028169014084507043
<br>Mean pooling: 0.028169014084507043
<br>um pooling: 0.028169014084507043
<br>Weighted sum: 0.028169014084507043
<br>
<br>**Query: garbage collection**
<br>Query Expansion:  garbag collector longer reclaim collect repositori data gather
<br>Recall@10: 
<br>Max pooling: 0.0
<br>Min pooling: 0.0
<br>Mean pooling: 0.0
<br>Sum pooling: 0.0
<br>Weighted sum: 0.13157894736842105
<br>
<br>**Query: retrieval model**
<br>Query Expansion:  retriev store data metadata model mathemat breakthrough illustr
<br>Recall@10: 
<br>Max pooling: 0.0
<br>Min pooling: 0.0
<br>Mean pooling: 0.0
<br>Sum pooling: 0.0
<br>Weighted sum: 0.0

In [0]:
print(model.wv.most_similar(positive=porter.stem('relational'), topn=3))
print(model.wv.most_similar(positive=porter.stem('database'), topn=3))
print(model.wv.most_similar(positive=porter.stem('garbage'), topn=3))
print(model.wv.most_similar(positive=porter.stem('collection'), topn=3))
print(model.wv.most_similar(positive=porter.stem('retrieval'), topn=3))
print(model.wv.most_similar(positive=porter.stem('model'), topn=3))

def query_expansion(querys):
    querys_expand = ''
    for query in querys.split():
        querys_expand = querys_expand + " " + porter.stem(query)
        querys_expand = querys_expand + " " + str(model.wv.most_similar(positive=porter.stem(query), topn=3)[0][0])
        querys_expand = querys_expand + " " + str(model.wv.most_similar(positive=porter.stem(query), topn=3)[1][0])
        querys_expand = querys_expand + " " + str(model.wv.most_similar(positive=porter.stem(query), topn=3)[2][0])
    return querys_expand

[('tabl', 0.8046892881393433), ('entiti', 0.7562270164489746), ('common', 0.7351295351982117)]
[('dbm', 0.8211696147918701), ('collect', 0.7607239484786987), ('compris', 0.7468408942222595)]
[('collector', 0.8240309953689575), ('longer', 0.8046669960021973), ('reclaim', 0.7889659404754639)]
[('repositori', 0.7911487817764282), ('data', 0.7804063558578491), ('gather', 0.7676119208335876)]
[('store', 0.873002827167511), ('data', 0.7816286087036133), ('metadata', 0.7605143785476685)]
[('mathemat', 0.787505030632019), ('breakthrough', 0.7493785619735718), ('illustr', 0.7133469581604004)]


  if np.issubdtype(vec.dtype, np.int):


In [0]:
query1 = "relational database"
query1_expand = query_expansion(query1)
query2 = "garbage collection"
query2_expand = query_expansion(query2)
query3 = "retrieval model"
query3_expand = query_expansion(query3)

maxScores, minScores, meanScores, sumScores, weightScores = rank_querys_docs (stemDocumentRepo, query1_expand, model)
print ("Query: " + str(query1))
print ("Query Expansion: " + str(query1_expand))
print ("Recall@10: ")
maxprecision = calculate_recall_10(maxScores, query1, relevant1)
minprecision = calculate_recall_10(minScores, query1, relevant1)
meanprecision = calculate_recall_10(meanScores, query1, relevant1)
sumprecision = calculate_recall_10(sumScores, query1, relevant1)
weightprecision = calculate_recall_10(weightScores, query1, relevant1)
print ("Max pooling: " + str(maxprecision))
print ("Min pooling: " + str(minprecision))
print ("Mean pooling: " + str(meanprecision))
print ("Sum pooling: " + str(sumprecision))
print ("Weighted sum: " + str(weightprecision))

maxScores, minScores, meanScores, sumScores, weightScores = rank_querys_docs (stemDocumentRepo, query2_expand, model)
print ("Query: " + str(query2))
print ("Query Expansion: " + str(query2_expand))
print ("Recall@10: ")
maxprecision = calculate_recall_10(maxScores, query2, relevant2)
minprecision = calculate_recall_10(minScores, query2, relevant2)
meanprecision = calculate_recall_10(meanScores, query2, relevant2)
sumprecision = calculate_recall_10(sumScores, query2, relevant2)
weightprecision = calculate_recall_10(weightScores, query2, relevant2)
print ("Max pooling: " + str(maxprecision))
print ("Min pooling: " + str(minprecision))
print ("Mean pooling: " + str(meanprecision))
print ("Sum pooling: " + str(sumprecision))
print ("Weighted sum: " + str(weightprecision))

maxScores, minScores, meanScores, sumScores, weightScores = rank_querys_docs (stemDocumentRepo, query3_expand, model)
print ("Query: " + str(query3))
print ("Query Expansion: " + str(query3_expand))
print ("Recall@10: ")
maxprecision = calculate_recall_10(maxScores, query3, relevant3)
minprecision = calculate_recall_10(minScores, query3, relevant3)
meanprecision = calculate_recall_10(meanScores, query3, relevant3)
sumprecision = calculate_recall_10(sumScores, query3, relevant3)
weightprecision = calculate_recall_10(weightScores, query3, relevant3)
print ("Max pooling: " + str(maxprecision))
print ("Min pooling: " + str(minprecision))
print ("Mean pooling: " + str(meanprecision))
print ("Sum pooling: " + str(sumprecision))
print ("Weighted sum: " + str(weightprecision))


  if np.issubdtype(vec.dtype, np.int):


Query: relational database
Query Expansion:  relat tabl entiti common databas dbm collect compris
Recall@10: 
Max pooling: 0.014084507042253521
Min pooling: 0.028169014084507043
Mean pooling: 0.028169014084507043
Sum pooling: 0.028169014084507043
Weighted sum: 0.028169014084507043
Query: garbage collection
Query Expansion:  garbag collector longer reclaim collect repositori data gather
Recall@10: 
Max pooling: 0.0
Min pooling: 0.0
Mean pooling: 0.0
Sum pooling: 0.0
Weighted sum: 0.13157894736842105
Query: retrieval model
Query Expansion:  retriev store data metadata model mathemat breakthrough illustr
Recall@10: 
Max pooling: 0.0
Min pooling: 0.0
Mean pooling: 0.0
Sum pooling: 0.0
Weighted sum: 0.0


### Discussions
Why we measure recall here instead of precision or NDCG?
<br>**[ANS]**
**The Query Expansion may significantly decrease precision/NDCG rate, particularly with ambiguous terms. But we can extract as much possibilities of the correct term. Therefore, we select recall metric here**

Should the tokens added for expansion have the same importance as the original query tokens? If not, how to improve the query expansion in this part?
<br>**[ANS]**
<br>**No, I assign lower weightings to the expanded queries: if the original query's weighting is 1.0, the one for expanded query is 0.33333**
<br>**My guessed suggestion here is, maybe we discard the combination based method, but get top 10 results of each query and their expanded queries, then do intersections.**
<br>
<br>For instance, take the following example.
<br>Query: relational database
<br>**Query Expansion: relat tabl entiti common databas dbm collect compris**
<br>We can split by [relat databas], [tabl dbm], [entiti, collect], [common, compris] and find those query-pairs' top 10, and implement intersections.
<br>Also, if needed, we can add higher weightings for the original query, which is [relat databas].