#### CSCE 670 :: Information Storage and Retrieval :: Texas A&M University :: Spring 2020


# Homework 4:  Word Embeddings for Information Retrieval and Query Expansion

### 100 points [5% of your final grade]

### Due: April 28, 2020 by 11:59pm

*Goals of this homework:* In this homework you will improve your information retrieval engine in homework 1 by word embeddings to: (i) directly match the query and the document in the latent semantic space of word embeddings; (ii) expand the original query via word embeddings.

*Submission instructions (eCampus):* To submit your homework, rename this notebook as `UIN_hw4.ipynb`. For example, my homework submission would be something like `555001234_hw4.ipynb`. Submit this notebook via eCampus (look for the homework 1 assignment there). Your notebook should be completely self-contained, with the results visible in the notebook. We should not have to run any code from the command line, nor should we have to run your code within the notebook (though we reserve the right to do so). So please run all the cells for us, and then submit.

*Late submission policy:* For this homework, you may use as many late days as you like (up to the total allotted to you).

*Collaboration policy:* You are expected to complete each homework independently. Your solution should be written by you without the direct aid or help of anyone else. However, we believe that collaboration and team work are important for facilitating learning, so we encourage you to discuss problems and general problem approaches (but not actual solutions) with your classmates. You may post on Piazza, search StackOverflow, etc. But if you do get help in this way, you must inform us by **filling out the Collaboration Declarations at the bottom of this notebook**. 

*Example: I found helpful code on stackoverflow at https://stackoverflow.com/questions/11764539/writing-fizzbuzz that helped me solve Problem 2.*

The basic rule is that no student should explicitly share a solution with another student (and thereby circumvent the basic learning process), but it is okay to share general approaches, directions, and so on. If you feel like you have an issue that needs clarification, feel free to contact either me or the TA.

## Part 0. Dataset and Parsing (The same as Homework 1)

The dataset is collected from Quizlet (https://quizlet.com), a website where users can generated their own flashcards. Each flashcard generated by a user is made up of an entity on the front and a definition describing or explaining the entity correspondingly on the back. We treat entities on each flashcard's front as the queries and the definitions on the back of flashcards as the documents. Definitions (documents) are relevant to an entity (query) if the definitions are from the back of the entity's flashcard; otherwise definitions are not relevant. **In this homework, queries and entities are interchangeable as well as documents and definitions.**

The format of the dataset is like this:

**query \t document id \t document**

Examples:

decision tree	\t 27946 \t	show complex processes with multiple decision rules.  display decision logic (if statements) as set of (nodes) questions and branches (answers).

where "decision tree" is the entity in the front of a flashcard and "show complex processes with multiple decision rules.  display decision logic (if statements) as set of (nodes) questions and branches (answers)." is the definition on the flashcard's back and "27946" is the id of the definition. Naturally, this document is relevant to the query.

false positive rate	\t 686	\t fall-out; probability of a false alarm

where document 686 is not relevant to query "decision tree" because the entity of "fall-out; probability of a false alarm" is "false positive rate".

For parsing this dataset, you could also just copy your code from homework 1 to complete the following tasks:
* Tokenize documents (definitions) using **whitespaces and punctuations as delimiters**.
* Remove stop words: use nltk stop words list (from nltk.corpus import stopwords)
* Stemming: use [nltk Porter stemmer](http://www.nltk.org/api/nltk.stem.html#module-nltk.stem.porter)
* Remove any other strings that you think are less informative or nosiy.

In [1]:
# Import necessary packages

from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer

import time
import math

import numpy as np
from numpy import dot
from numpy.linalg import norm

from collections import Counter

In [2]:
# configuration options
remove_stopwords = True
use_stemming = True
remove_otherNoise = True

In [3]:
# Your parser function here. It will take the three option variables above as the parameters
# It does the inital pre processing except for removal of stopwords. 
# Stopwords are removed after the inverted index is built.

def myParser(word):
    
    # handling leading and trailing hyphens
    word = word.strip("-")
    
    # Case Folding
    word = word.lower()            
    
    if use_stemming == True:
        # Stem the words user Porter's Algorithm
        myStemmer = PorterStemmer()
        word = myStemmer.stem(word)
                
    return word

In [4]:
# Build Dataset
# contains the documents indexed by their  ids
# list with 2 fields, [0] -> definition [1] -> Entitiy
documents = {}

# document tokens indexed by their ids to avoid redundant processing
doc_tokens = {}

Total_document_length = 0

# inverted index for our corpus
inverted_index = {}

# word frequency for ranking the doucments
# 0 -> term frequency in the corpus
word_frequency = {}

# prepare a vocabulary index for the column id of individual words
vocab_id_table = {}
vocab_id = 0

# get a document id list for our queries
query_ground_truth = {"relational database":[], "garbage collection":[],"retrieval model":[]}

stopwords_set = set(stopwords.words('english'))
stopword_count  = 0
with open('homework_1_data.txt', 'r', encoding="utf8") as f:
    lines = f.readlines()
    for line in lines:
        doc = line.split('\t')
        curr_doc_id = int(doc[1])

        if doc[0] in query_ground_truth:
            query_ground_truth[doc[0]].append(curr_doc_id)

            # ids converted to integer for faster matching
        documents[curr_doc_id] = [doc[2],doc[0]]

        doc[2] = ''.join(c if c.isalpha() else ' 'for c in doc[2] )

        if remove_otherNoise == True:        
            # Remove irrelevant punctuations
            remove_punct = '!"#$&\'()*+,./;:<=>?@[\\]^`{|}~%_-'
            table = str.maketrans(remove_punct, ' '*len(remove_punct))
            doc[2] = doc[2].translate(table)

        words_stop = doc[2].split()
        words = []
        
        for word in words_stop:
            word = word.lower()
            
            if word in stopwords_set:
                continue
            
            if len(word) < 3:
                continue
            
            words.append(word)
                
        Total_document_length += len(words)

        # the set of seen words for each document
        for i in range(len(words)):

            words[i] = myParser(words[i])

            # is this word in our dictionary already?
            if words[i] in inverted_index:

                # add the current document id to that word's document list
                inverted_index[words[i]].add(curr_doc_id)

            # seeing the word for the first time
            else:
                # add the current document id to that word's document list
                inverted_index[words[i]] = set({curr_doc_id})

                # returns the id for individual words
                vocab_id_table[words[i]] = vocab_id
                vocab_id += 1

        # store the tokens for this document
        doc_tokens[curr_doc_id] = words

# total number of documents in the corpus
N = len(doc_tokens)

print("Size of the dictionary: ",len(inverted_index))

Size of the dictionary:  9122


In [5]:
#
## prepare vector space for documents
#

doc_vectors = np.zeros((N,vocab_id))
IDF = {}
IDF_BM25 = {}

for word,word_id in vocab_id_table.items():
    nq = len(inverted_index[word])
    IDF[word] = math.log10(N/nq)
    IDF_BM25[word] = math.log10((N - nq + 0.5)/(nq + 0.5))

for this_doc_id,this_doc_tokens in doc_tokens.items():

    # count the number of occurences for ech word        
    term_freq = Counter(this_doc_tokens)
    # for each word and its count
    for term,count in term_freq.items():
        tf = 1 + math.log10(count)
        # documents are rows and words are columns
        doc_vectors[this_doc_id][vocab_id_table[term]] += (tf * IDF[term])

# Part 1: Word2Vec (30 points)

In this part you will use the Word2Vec algorithm to generate word embeddings for tokens in the dataset. You can just use a package like https://radimrehurek.com/gensim/models/word2vec.html. Let's set the size of word embeddings to be 20. Please print the word embeddings for the tokens: 
* relational
* database
* garbage
* collection
* retrieval 
* model

In [6]:
# code here.
import gensim.models
embedding_size = 20
model = gensim.models.Word2Vec(sentences =list(doc_tokens.values()),size=embedding_size,window=5)

In [7]:
# print the word embeddings of the six tokens
relational = model.wv[myParser("relational")]
database = model.wv[myParser("database")]
garbage = model.wv[myParser("garbage")]
collection = model.wv[myParser("collection")]
retrieval = model.wv[myParser("retrieval")]
model_vec = model.wv[myParser("model")]

print("-------------------------------------------------------------------")
print("Relational\n" , relational)
print("-------------------------------------------------------------------")
print("Database\n", database)
print("-------------------------------------------------------------------")
print("Garbage\n",garbage)
print("-------------------------------------------------------------------")
print("Collection\n", collection)
print("-------------------------------------------------------------------")
print("Retrieval\n", retrieval)
print("-------------------------------------------------------------------")
print("Model\n",model_vec)
print("-------------------------------------------------------------------")


-------------------------------------------------------------------
Relational
 [ 2.2680657   0.06140756 -0.44610825 -1.1656089  -0.84581745 -0.88587624
  1.3794376  -2.1949944   0.44469178  0.4344101   2.9320416  -0.91699606
  2.0353422   0.4220861   0.9019404  -0.2977019   2.06147     2.64691
 -1.083384    0.5937177 ]
-------------------------------------------------------------------
Database
 [ 1.5953257  -0.5497807   0.39062363 -2.8846865  -1.8896031   1.4535891
  0.47864187 -2.3056474   3.1766396  -0.26811945  2.465781   -1.5395129
  1.417544    0.3847839  -0.8906662  -0.20475644  2.1688192   2.7075443
 -1.8384706   0.0804768 ]
-------------------------------------------------------------------
Garbage
 [ 0.23651579 -0.27403742 -0.15699883  0.16288637  0.01649721 -0.00773037
 -0.00263683 -0.17949635  0.36617285 -0.10970375 -0.01249154 -0.08075885
  0.19968687  0.00279451  0.12102219  0.04184372  0.4077469   0.04915323
 -0.36516207 -0.17871958]
------------------------------------

# Part 2: Vector Space Model via Word Embeddings (40 points) 

In this part, your job is to match the query and the document via the cosine similarity between the embeddings of them.

Since there are not just one token in a query or a document, the first challenge is how to aggregate many word embeddings into one embedding of a query or a document. There are many ways to do so: 
* Max pooling: return the maximum value along each dimension of a bunch of word embeddings. For example, [1, 3, 4], [2, 1, 5] -> [2, 3, 5].
* Min pooling: return the minimum value along each dimension of a bunch of word embeddings
* Mean pooling: return the mean value along each dimension of a bunch of word embeddings
* Sum: element-wise add a bunch of word embeddings together
* Weighted sum: assign weights to word embeddings and then add them together. Weights could be TF, IDF or TF-IDF.

In [8]:
# your code here

max_pool_docvec = np.zeros((len(doc_tokens),embedding_size))
min_pool_docvec = np.zeros((len(doc_tokens),embedding_size))
mean_pool_docvec = np.zeros((len(doc_tokens),embedding_size))
sum_docvec = np.zeros((len(doc_tokens),embedding_size))
weighted_pool_docvec = np.zeros((len(doc_tokens),embedding_size))

for doc_id,tokens in doc_tokens.items():
    if len(tokens)<1:
        continue
    doc_embeddings = np.zeros((len(tokens),embedding_size))

    for pos,token in enumerate(tokens):
        # the tokens are already parsed
        if token in model.wv:
            doc_embeddings[pos,:] = model.wv[token]
        
        weighted_pool_docvec[doc_id,:] += doc_embeddings[pos,:] * doc_vectors[doc_id,vocab_id_table[token]]

    max_pool_docvec[doc_id,:] = np.max(doc_embeddings,axis=0)
    min_pool_docvec[doc_id,:] = np.min(doc_embeddings,axis=0)
    mean_pool_docvec[doc_id,:] = np.mean(doc_embeddings,axis=0)
    sum_docvec[doc_id,:] = np.sum(doc_embeddings,axis=0)

Try different aggregation methods and report the precision@10 for these queries:
* query: relational database
* query: garbage collection
* query: retrieval model

In [9]:
queries = ['relational database','garbage collection','retrieval model']

max_pool_qvec = np.zeros((len(queries),embedding_size))
min_pool_qvec = np.zeros((len(queries),embedding_size))
mean_pool_qvec = np.zeros((len(queries),embedding_size))
sum_qvec = np.zeros((len(queries),embedding_size))
weighted_pool_qvec = np.zeros((len(queries),embedding_size))

for q_id,query in enumerate(queries):
    qtokens = query.split(' ')
    q_embeddings = np.zeros((len(qtokens),embedding_size))

    for qpos,qtoken in enumerate(qtokens):
        # the tokens are already parsed
        q_embeddings[qpos,:] = model.wv[myParser(qtoken)]
        weighted_pool_qvec[q_id,:] += model.wv[myParser(qtoken)] * IDF[myParser(qtoken)]
    
    max_pool_qvec[q_id,:] = np.max(q_embeddings,axis=0)
    min_pool_qvec[q_id,:] = np.min(q_embeddings,axis=0)
    mean_pool_qvec[q_id,:] = np.mean(q_embeddings,axis=0)
    sum_qvec[q_id,:] = np.sum(q_embeddings,axis=0)
    

In [10]:
cos_sim_maxpool = dot(max_pool_docvec,max_pool_qvec.T)
cos_sim_minpool = dot(min_pool_docvec,min_pool_qvec.T)
cos_sim_meanpool = dot(mean_pool_docvec,mean_pool_qvec.T)
cos_sim_sum = dot(sum_docvec,sum_qvec.T)
cos_sim_tfidf = dot(weighted_pool_docvec,weighted_pool_qvec.T)

print("Precision@10")
print()
for q_id,query in enumerate(queries):
    # gets the top 10 in linear time
    top10_maxpool = set(np.argpartition(cos_sim_maxpool[:,q_id].T, -10)[-10:])
    top10_minpool = set(np.argpartition(cos_sim_minpool[:,q_id].T, -10)[-10:])
    top10_meanpool = set(np.argpartition(cos_sim_meanpool[:,q_id].T, -10)[-10:])
    top10_sum = set(np.argpartition(cos_sim_sum[:,q_id].T, -10)[-10:])
    top10_tfidf = set(np.argpartition(cos_sim_tfidf[:,q_id].T, -10)[-10:])
        
    gt = set(query_ground_truth[queries[q_id]])
    
    z = top10_maxpool.intersection(gt)
    print(query, " Max pooling: ",len(z)/10)

    z = top10_minpool.intersection(gt)
    print(query, " Min pooling: ",len(z)/10)

    z = top10_meanpool.intersection(gt)
    print(query, " Mean pool: ",len(z)/10)
    
    z = top10_sum.intersection(gt)
    print(query, " sum: ",len(z)/10)

    z = top10_tfidf.intersection(gt)
    print(query, " weighted sum: ",len(z)/10)

    print()
    

Precision@10

relational database  Max pooling:  0.3
relational database  Min pooling:  0.5
relational database  Mean pool:  0.5
relational database  sum:  0.0
relational database  weighted sum:  0.0

garbage collection  Max pooling:  0.0
garbage collection  Min pooling:  0.0
garbage collection  Mean pool:  0.0
garbage collection  sum:  0.0
garbage collection  weighted sum:  0.0

retrieval model  Max pooling:  0.0
retrieval model  Min pooling:  0.0
retrieval model  Mean pool:  0.0
retrieval model  sum:  0.0
retrieval model  weighted sum:  0.0



### Discussion
Among these aggregation methods, which one is the best and which one is the worst?

Given the dataset isnt comprehensive or large enough to effectively learn word vectors, the realiablity and observations of the results is questionable.

Nevertheless, Amongst these, the maximum and minimum pooling methods seems to be performing the best.

Summing methods doesnt give the best results in our vector space.

# Part 3: Query Expansion via Word Embeddings (30 points) 
Remember the hardest query "retrieval model" in homework 1? Because there is no document containing "retrieval model" in the dataset, you cannot retrieve any documents by Boolean matching. Now, it is the time of your "revenge" via query expansion.

In this part, your job is to expand the original query like "retrieval model" by adding semantically similar words (e.g., "search"), which are selected from all tokens in the dataset.

There are many ways to do so. For this part, we want you to calculate the cosine similarity between each of the original query tokens and the other tokens based on their word embeddings.

First, please find the top 3 similar tokens for:
* relational
* database
* garbage
* collection
* retrieval 
* model

In [11]:
# your code here
expanded_queries = [query.split() for query in queries]

for q_id, query in enumerate(queries):
    for token in query.split(' '):
        expand = model.wv.most_similar(positive=[myParser(token)], topn=3)
        expand = [j[0] for j in expand]
        print(token,"-> ",expand)
        expanded_queries[q_id].extend(expand)

relational ->  ['entiti', 'tabl', 'rdb']
database ->  ['dbm', 'rdbm', 'entir']
garbage ->  ['later', 'yet', 'egg']
collection ->  ['repositori', 'gather', 'warehous']
retrieval ->  ['store', 'updat', 'warehous']
model ->  ['mathemat', 'formal', 'diagram']


Report recall@10 before the query expansion:

In [12]:
# your code here
max_pool_qvec = np.zeros((len(queries),embedding_size))
min_pool_qvec = np.zeros((len(queries),embedding_size))
mean_pool_qvec = np.zeros((len(queries),embedding_size))
sum_qvec = np.zeros((len(queries),embedding_size))
weighted_pool_qvec = np.zeros((len(queries),embedding_size))

for q_id,query in enumerate(queries):
    qtokens = query.split(' ')
    q_embeddings = np.zeros((len(qtokens),embedding_size))

    for qpos,qtoken in enumerate(qtokens):
        # the tokens are already parsed
        q_embeddings[qpos,:] = model.wv[myParser(qtoken)]
        weighted_pool_qvec[q_id,:] += model.wv[myParser(qtoken)] * IDF[myParser(qtoken)]
    
    max_pool_qvec[q_id,:] = np.max(q_embeddings,axis=0)
    min_pool_qvec[q_id,:] = np.min(q_embeddings,axis=0)
    mean_pool_qvec[q_id,:] = np.mean(q_embeddings,axis=0)
    sum_qvec[q_id,:] = np.sum(q_embeddings,axis=0)

cos_sim_maxpool = dot(max_pool_docvec,max_pool_qvec.T)
cos_sim_minpool = dot(min_pool_docvec,min_pool_qvec.T)
cos_sim_meanpool = dot(mean_pool_docvec,mean_pool_qvec.T)
cos_sim_sum = dot(sum_docvec,sum_qvec.T)
cos_sim_tfidf = dot(weighted_pool_docvec,weighted_pool_qvec.T)

print("Recall@10 (Before expanding queries)")
print()
for q_id,query in enumerate(queries):
    # gets the top 10 in linear time
    top10_maxpool = set(np.argpartition(cos_sim_maxpool[:,q_id].T, -10)[-10:])
    top10_minpool = set(np.argpartition(cos_sim_minpool[:,q_id].T, -10)[-10:])
    top10_meanpool = set(np.argpartition(cos_sim_meanpool[:,q_id].T, -10)[-10:])
    top10_sum = set(np.argpartition(cos_sim_sum[:,q_id].T, -10)[-10:])
    top10_tfidf = set(np.argpartition(cos_sim_tfidf[:,q_id].T, -10)[-10:])
        
    gt = set(query_ground_truth[queries[q_id]])
    
    z = top10_maxpool.intersection(gt)
    print(query, " Max pooling: ",len(z)/len(gt))

    z = top10_minpool.intersection(gt)
    print(query, " Min pooling: ",len(z)/len(gt))

    z = top10_meanpool.intersection(gt)
    print(query, " Mean pool: ",len(z)/len(gt))
    
    z = top10_sum.intersection(gt)
    print(query, " sum: ",len(z)/len(gt))

    z = top10_tfidf.intersection(gt)
    print(query, " weighted sum: ",len(z)/len(gt))

    print()
    

Recall@10 (Before expanding queries)

relational database  Max pooling:  0.01056338028169014
relational database  Min pooling:  0.017605633802816902
relational database  Mean pool:  0.017605633802816902
relational database  sum:  0.0
relational database  weighted sum:  0.0

garbage collection  Max pooling:  0.0
garbage collection  Min pooling:  0.0
garbage collection  Mean pool:  0.0
garbage collection  sum:  0.0
garbage collection  weighted sum:  0.0

retrieval model  Max pooling:  0.0
retrieval model  Min pooling:  0.0
retrieval model  Mean pool:  0.0
retrieval model  sum:  0.0
retrieval model  weighted sum:  0.0



Report recall@10 after the query expansion:

In [13]:
# your code here
max_pool_qvec = np.zeros((len(expanded_queries),embedding_size))
min_pool_qvec = np.zeros((len(expanded_queries),embedding_size))
mean_pool_qvec = np.zeros((len(expanded_queries),embedding_size))
sum_qvec = np.zeros((len(expanded_queries),embedding_size))
weighted_pool_qvec = np.zeros((len(expanded_queries),embedding_size))

for q_id,qtokens in enumerate(expanded_queries):
    q_embeddings = np.zeros((len(qtokens),embedding_size))

    for qpos,qtoken in enumerate(qtokens):
        # the tokens are already parsed
        q_embeddings[qpos,:] = model.wv[myParser(qtoken)]
        weighted_pool_qvec[q_id,:] += model.wv[myParser(qtoken)] * IDF[myParser(qtoken)]
    
    max_pool_qvec[q_id,:] = np.max(q_embeddings,axis=0)
    min_pool_qvec[q_id,:] = np.min(q_embeddings,axis=0)
    mean_pool_qvec[q_id,:] = np.mean(q_embeddings,axis=0)
    sum_qvec[q_id,:] = np.sum(q_embeddings,axis=0)

cos_sim_maxpool = dot(max_pool_docvec,max_pool_qvec.T)
cos_sim_minpool = dot(min_pool_docvec,min_pool_qvec.T)
cos_sim_meanpool = dot(mean_pool_docvec,mean_pool_qvec.T)
cos_sim_sum = dot(sum_docvec,sum_qvec.T)
cos_sim_tfidf = dot(weighted_pool_docvec,weighted_pool_qvec.T)

print("Recall@10 (After expanding queries)")
print()
for q_id,query in enumerate(queries):
    print("Original: ",query.split())
    print("Expanded: ",expanded_queries[q_id])
    print()

    # gets the top 10 in linear time
    top10_maxpool = set(np.argpartition(cos_sim_maxpool[:,q_id].T, -10)[-10:])
    top10_minpool = set(np.argpartition(cos_sim_minpool[:,q_id].T, -10)[-10:])
    top10_meanpool = set(np.argpartition(cos_sim_meanpool[:,q_id].T, -10)[-10:])
    top10_sum = set(np.argpartition(cos_sim_sum[:,q_id].T, -10)[-10:])
    top10_tfidf = set(np.argpartition(cos_sim_tfidf[:,q_id].T, -10)[-10:])
        
    gt = set(query_ground_truth[queries[q_id]])
    
    z = top10_maxpool.intersection(gt)
    print(query, " Max pooling: ",len(z)/len(gt))

    z = top10_minpool.intersection(gt)
    print(query, " Min pooling: ",len(z)/len(gt))

    z = top10_meanpool.intersection(gt)
    print(query, " Mean pool: ",len(z)/len(gt))
    
    z = top10_sum.intersection(gt)
    print(query, " sum: ",len(z)/len(gt))

    z = top10_tfidf.intersection(gt)
    print(query, " weighted sum: ",len(z)/len(gt))

    print()
    

Recall@10 (After expanding queries)

Original:  ['relational', 'database']
Expanded:  ['relational', 'database', 'entiti', 'tabl', 'rdb', 'dbm', 'rdbm', 'entir']

relational database  Max pooling:  0.0
relational database  Min pooling:  0.0
relational database  Mean pool:  0.017605633802816902
relational database  sum:  0.0
relational database  weighted sum:  0.0

Original:  ['garbage', 'collection']
Expanded:  ['garbage', 'collection', 'later', 'yet', 'egg', 'repositori', 'gather', 'warehous']

garbage collection  Max pooling:  0.0
garbage collection  Min pooling:  0.0
garbage collection  Mean pool:  0.0
garbage collection  sum:  0.0
garbage collection  weighted sum:  0.0

Original:  ['retrieval', 'model']
Expanded:  ['retrieval', 'model', 'store', 'updat', 'warehous', 'mathemat', 'formal', 'diagram']

retrieval model  Max pooling:  0.0
retrieval model  Min pooling:  0.0
retrieval model  Mean pool:  0.0
retrieval model  sum:  0.0
retrieval model  weighted sum:  0.0



### Discussion
Why we measure recall here instead of precision or NDCG?
Getting ground truth ranking scores is tough in the given dataset. And without that, we are not effectively implementing NDCG
acccurately. The intutions from this metric could be misleading and hence we opt for recall on this dataset.

Should the tokens added for expansion have the same importance as the original query tokens? If not, how to improve the query expansion in this part?
No, the added tokens should not have the same importance as the orignal query tokens as they arent what the user explicitly asked for.
To improve, we can do a weighted sum where each of the newly added vectors are multiplied by its cosine similarity with the query words