**Team Project: Web Search and Information Retrieval - Topic 4 Effiecient Vector Retrieval**

# IR_v31_evaluation - 05/26/2019

**Purpose:** Test VSM retrieval model on **doc_dump.txt** text collection.

# Evaluation

### Test Retrieval Performance on Full Document Collection - 'doc_dump.txt'

### Note: Here we use doc_dump.txt which means the unpreprocessed data!

In [1]:
# read in raw train.docs text and split
raw_texts = open('data/raw/doc_dump.txt', encoding="utf-8").read()
doc_list = raw_texts.split("\n")
len(doc_list)

5371

Show typical example of document in doc_dump.txt

In [2]:
# Show typical example of document in doc_dump.txt
doc_list[1].split("\t")

['MED-2',
 'http://www.ncbi.nlm.nih.gov/pubmed/22809476',
 'A statistical regression model for the estimation of acrylamide concentrations in French fries for excess lifetime cancer risk assessment. - PubMed - NCBI',
 'Abstract Human exposure to acrylamide (AA) through consumption of French fries and other foods has been recognized as a potential health concern. Here, we used a statistical non-linear regression model, based on the two most influential factors, cooking temperature and time, to estimate AA concentrations in French fries. The R(2) of the predictive model is 0.83, suggesting the developed model was significant and valid. Based on French fry intake survey data conducted in this study and eight frying temperature-time schemes which can produce tasty and visually appealing French fries, the Monte Carlo simulation results showed that if AA concentration is higher than 168 ppb, the estimated cancer risk for adolescents aged 13-18 years in Taichung City would be already higher t

In [3]:
# create document collection D
doc_collection = dict()

for i in range(len(doc_list)):
    list_ = doc_list[i].split("\t")
    if len(list_) == 4:
        key_ = list_[0]
        title_ = list_[2]
        text_ = list_[3]
        value_ = title_ + " " + text_
        doc_collection.update({key_: value_})
    else:
        continue

Use same list of documents that were used for testing in paper

In [4]:
# load list of documents used for testing
raw_texts = open('data/raw/test.docs.ids').read()
test_list = raw_texts.split("\n")
len(test_list)

3163

In [5]:
# note that the last element in list is simply an empty string and will be removed
test_list[3162]
test_list.pop()
len(test_list)

3162

In [6]:
# create test document collection that will be used to compute speed, MAP
doc_collection_test = dict()
for idx in doc_collection.keys():
    # print(idx)
    if idx in test_list:
        key_ = idx
        value_ = doc_collection[idx]
        doc_collection_test.update({key_: value_})
    else:
        continue
        
len(doc_collection_test)

3162

In [7]:
# delete full doc_collection to avoid confusion!
del doc_collection

### *LOAD OWN IMPLEMENTTAION OF VECTOR SPACE RETRIEVAL FUNCTIONS*

In [8]:
# load own implementations of VSM
from VSM_functions import *

[nltk_data] Downloading package wordnet to /home/roman/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


### *INDEXING*

In [9]:
# compute global idf scores on D
idfs = compute_Idf(doc_collection_test)

Idf computation done in 16.6076s.


In [10]:
# Compute tfidf scores
tfidfs = compute_TfIdf(doc_collection_test, idfs)

Tf-idf computation done in 22.7820s.


In [11]:
# Construct inverted index
inverted_index = construct_invertedIndex(doc_collection_test, idfs, tfidfs)

InvertedIndex construction done in 0.2316s.


In [12]:
# construct term document matrix
tdm = create_tdm(doc_collection_test, tfidfs)

TDM construction done in 7.7208s.


In [13]:
# Compute dict that stores vector norm length for each document d in D
doc_lengths = construct_docLengthDict(doc_collection_test, tfidfs)

DocLength index construction done in 0.1292s.


In [14]:
# Compute preclustering with sqrt(n) random leaders
clusters = pre_cluster(doc_collection_test)

Preclustering done in 22.2298s.


In [15]:
# Construct tiered index
tiered_index = construct_tiered_index(doc_collection_test, inverted_index, t=0.5)

TieredIndex construction done in 0.2153s.


Full indexing takes ca. 1min - 2min. 

### *QUERYING*

### Load in query collection - Using train.nontopic-titles.queries

In [16]:
# read in raw text and split
raw_texts = open('data/test/test.nontopic-titles.queries', encoding="utf-8").read()
query_list = raw_texts.split("\n")
len(query_list)

145

In [17]:
# create dictionary that holds the queryID + queryText
query_col = dict()

for i in range(len(query_list)):
    if len(query_list[i].split("\t")) == 2:
        key_, value_ = query_list[i].split("\t")
        query_col.update({key_ : value_})
        
len(query_col)

144

An example of a typical non.topic page title resembles a query an average user would type

In [18]:
# display an example query
query_col["PLAIN-2590"]

'do vegetarians get enough protein ?'

### Load in gold standard for relevance judgements - Using test.2-1-0.qrel

In [19]:
# read in raw text and split
raw_texts = open('data/test/test.2-1-0.qrel').read()
rel_list = raw_texts.split("\n")
len(rel_list)

12335

In [20]:
# create dictionary that holds queryID + emptyList for all queries in Q
gold_col = dict()

for i in range(len(rel_list)):
    list_ = rel_list[i].split("\t")
    if len(list_)==4:
        key_ = list_[0]
        value_ = list()
        if key_ in query_col.keys():
            gold_col.update({key_ : value_})
    else:
        continue

In [21]:
for i in range(len(rel_list)):
    list_ = rel_list[i].split("\t")
    if len(list_)==4:
        key_ = list_[0]
        docID_ = list_[2]
        score_ = list_[3]
        if key_ in gold_col.keys():
            tuple_ = (docID_, int(score_))
            gold_col[key_].append(tuple_)
    else:
        continue
        
# show length of gold_col
len(gold_col)

144

In [22]:
# show list of relevant documents + relevance judgements for PLAIN-2590
gold_col["PLAIN-2590"]

[('MED-2288', 2),
 ('MED-3137', 2),
 ('MED-2290', 2),
 ('MED-2291', 2),
 ('MED-2292', 2),
 ('MED-2293', 2),
 ('MED-2294', 2),
 ('MED-2295', 2),
 ('MED-2296', 2),
 ('MED-2498', 1),
 ('MED-2517', 1),
 ('MED-2519', 1),
 ('MED-2501', 1),
 ('MED-2502', 1),
 ('MED-2513', 1),
 ('MED-2504', 1),
 ('MED-2505', 1),
 ('MED-2506', 1),
 ('MED-2507', 1),
 ('MED-5239', 1),
 ('MED-2509', 1),
 ('MED-2510', 1),
 ('MED-2511', 1),
 ('MED-2512', 1),
 ('MED-3000', 1),
 ('MED-2765', 1),
 ('MED-2997', 1),
 ('MED-3001', 1),
 ('MED-2999', 1),
 ('MED-4313', 1),
 ('MED-3148', 1),
 ('MED-3149', 1),
 ('MED-3242', 1),
 ('MED-3243', 1),
 ('MED-3244', 1),
 ('MED-3245', 1),
 ('MED-3270', 1),
 ('MED-3271', 1),
 ('MED-3272', 1),
 ('MED-3273', 1),
 ('MED-3274', 1),
 ('MED-3275', 1),
 ('MED-3276', 1),
 ('MED-3277', 1),
 ('MED-3278', 1),
 ('MED-3279', 1),
 ('MED-3280', 1),
 ('MED-3281', 1),
 ('MED-3282', 1),
 ('MED-3283', 1),
 ('MED-3580', 1),
 ('MED-3581', 1),
 ('MED-3582', 1),
 ('MED-3583', 1),
 ('MED-3584', 1),
 ('MED-385

Given that we now have ordered relevance judgements for all queries in our test collection we can measure the nDCG score for our retrieval systems!

# nDCG Evaluation

# Approach 1: 'vanilla' (using Term-Document Matrix)

### Evaluation - Normalized Discounted Cumulative Gain (nDCG)

In [23]:
# Evaluate nDCG 
nDCG_list = list()

for qIDX, qTEXT in query_col.items():
    query = qTEXT
    gold_list=gold_col[qIDX]
    
    topK_scores = top_k_retrieval(q = query, TDM=tdm, idfDict=idfs,
                        D = doc_collection_test, k = 3162, strategy="vanilla",
                        show_documents=False, print_scores=False,
                       return_results=True, return_speed=False)
    
    qScore = evaluate_nDCG(y_pred=topK_scores, y_true=gold_list, variant="raw_scores")
    nDCG_list.append(qScore)

In [24]:
nDCG_avg = np.mean(nDCG_list)
print("nDCG: {:.4f}".format(nDCG_avg))

nDCG: 0.4749


# Approach 2: 'optimal' (using invertedIndex)

### Evaluation - Normalized Discounted Cumulative Gain (nDCG)

In [25]:
# Evaluate nDCG 
nDCG_list = list()

for qIDX, qTEXT in query_col.items():
    query = qTEXT
    gold_list=gold_col[qIDX]
    
    topK_scores = top_k_retrieval(q = query, D = doc_collection_test, k = 3162,
                                idfDict = idfs, invertedIdx = inverted_index,
                                lengthIdx = doc_lengths,
                                show_documents=False, return_results=True, print_scores=False)
    
    qScore = evaluate_nDCG(y_pred=topK_scores, y_true=gold_list, variant="raw_scores")
    nDCG_list.append(qScore)

In [26]:
nDCG_avg = np.mean(nDCG_list)
print("nDCG: {:.4f}".format(nDCG_avg))

nDCG: 0.4749


# Approach 3: 'postingMerge Intersection'

### Evaluation - Normalized Discounted Cumulative Gain (nDCG)

In [27]:
# Evaluate nDCG 
nDCG_list = list()

for qIDX, qTEXT in query_col.items():
    query = qTEXT
    gold_list=gold_col[qIDX]
    
    topK_scores = top_k_retrieval(q = query, D = doc_collection_test, k = 3162, strategy="intersection",
                                    idfDict = idfs, invertedIdx = inverted_index,
                                    lengthIdx = doc_lengths,
                                    show_documents=False, print_scores=False,
                                    return_results=True, return_speed=False)
    
    qScore = evaluate_nDCG(y_pred=topK_scores, y_true=gold_list, variant="raw_scores")
    nDCG_list.append(qScore)

In [28]:
nDCG_avg = np.mean(nDCG_list)
print("nDCG: {:.4f}".format(nDCG_avg))

nDCG: 0.2993


# Approach 4: 'preclustering'

### Evaluation - Normalized Discounted Cumulative Gain (nDCG)

In [29]:
# Evaluate nDCG 
nDCG_list = list()

for qIDX, qTEXT in query_col.items():
    query = qTEXT
    gold_list=gold_col[qIDX]
    
    topK_scores = top_k_retrieval(q = query, D = doc_collection_test, k = 3162, strategy="preclustering",
                                         idfDict = idfs, invertedIdx = inverted_index,
                                         lengthIdx = doc_lengths, preClusterDict=clusters,
                                         show_documents=False, print_scores=False,
                                         return_results=True, return_speed=False)
    
    qScore = evaluate_nDCG(y_pred=topK_scores, y_true=gold_list, variant="raw_scores")
    nDCG_list.append(qScore)

In [30]:
nDCG_avg = np.mean(nDCG_list)
print("nDCG: {:.4f}".format(nDCG_avg))

nDCG: 0.3171


# Approach 5: 'tiered_index' with t = 0.5

### Evaluation - Normalized Discounted Cumulative Gain (nDCG)

In [31]:
# Evaluate nDCG 
nDCG_list = list()

for qIDX, qTEXT in query_col.items():
    query = qTEXT
    gold_list=gold_col[qIDX]
    
    topK_scores = top_k_retrieval(q = query, D = doc_collection_test, k = 3162, strategy="tiered",
                                         idfDict = idfs, invertedIdx = inverted_index,
                                         lengthIdx = doc_lengths, tieredIdx = tiered_index,
                                         show_documents=False, print_scores=False,
                                         return_results=True, return_speed=False)
    
    qScore = evaluate_nDCG(y_pred=topK_scores, y_true=gold_list, variant="raw_scores")
    nDCG_list.append(qScore)

In [32]:
nDCG_avg = np.mean(nDCG_list)
print("nDCG: {:.4f}".format(nDCG_avg))

nDCG: 0.4711


# Approach 5b: 'tiered_index' with t=0.8

In [33]:
# Construct tiered index
tiered_index = construct_tiered_index(doc_collection_test, inverted_index, t=0.8)

TieredIndex construction done in 0.2104s.


### Evaluation - Normalized Discounted Cumulative Gain (nDCG)

In [34]:
# Evaluate nDCG 
nDCG_list = list()

for qIDX, qTEXT in query_col.items():
    query = qTEXT
    gold_list=gold_col[qIDX]
    
    topK_scores = top_k_retrieval(q = query, D = doc_collection_test, k = 3162, strategy="tiered",
                                         idfDict = idfs, invertedIdx = inverted_index,
                                         lengthIdx = doc_lengths, tieredIdx = tiered_index,
                                         show_documents=False, print_scores=False,
                                         return_results=True, return_speed=False)
    
    qScore = evaluate_nDCG(y_pred=topK_scores, y_true=gold_list, variant="raw_scores")
    nDCG_list.append(qScore)

In [35]:
nDCG_avg = np.mean(nDCG_list)
print("nDCG: {:.4f}".format(nDCG_avg))

nDCG: 0.4626
