In [4]:
import pyndri

index = pyndri.Index('index/')

In [7]:
import collections
import io
import logging
import sys
import time

def parse_topics(file_or_files,
                 max_topics=sys.maxsize, delimiter=';'):
    assert max_topics >= 0 or max_topics is None

    topics = collections.OrderedDict()

    if not isinstance(file_or_files, list) and \
            not isinstance(file_or_files, tuple):
        if hasattr(file_or_files, '__iter__'):
            file_or_files = list(file_or_files)
        else:
            file_or_files = [file_or_files]

    for f in file_or_files:
        assert isinstance(f, io.IOBase)

        for line in f:
            assert(isinstance(line, str))

            line = line.strip()

            if not line:
                continue

            topic_id, terms = line.split(delimiter, 1)

            if topic_id in topics and (topics[topic_id] != terms):
                    logging.error('Duplicate topic "%s" (%s vs. %s).',
                                  topic_id,
                                  topics[topic_id],
                                  terms)

            topics[topic_id] = terms

            if max_topics > 0 and len(topics) >= max_topics:
                break

    return topics

In [8]:
with open('./ap_88_89/topics_title', 'r') as f_topics:
    queries = parse_topics([f_topics])

index = pyndri.Index('index/')

num_documents = index.maximum_document() - index.document_base()

dictionary = pyndri.extract_dictionary(index)

tokenized_queries = {
    query_id: [dictionary.translate_token(token)
               for token in index.tokenize(query_string)
               if dictionary.has_token(token)]
    for query_id, query_string in queries.items()}

query_term_ids = set(
    query_term_id
    for query_term_ids in tokenized_queries.values()
    for query_term_id in query_term_ids)

print('Gathering statistics about', len(query_term_ids), 'terms.')

# inverted index creation.

document_lengths = {}
unique_terms_per_document = {}

inverted_index = collections.defaultdict(dict)
collection_frequencies = collections.defaultdict(int)

total_terms = 0

for int_doc_id in range(index.document_base(), index.maximum_document()):
    ext_doc_id, doc_token_ids = index.document(int_doc_id)

    document_bow = collections.Counter(
        token_id for token_id in doc_token_ids
        if token_id > 0)
    document_length = sum(document_bow.values())

    document_lengths[int_doc_id] = document_length
    total_terms += document_length

    unique_terms_per_document[int_doc_id] = len(document_bow)

    for query_term_id in query_term_ids:
        assert query_term_id is not None

        document_term_frequency = document_bow.get(query_term_id, 0)

        if document_term_frequency == 0:
            continue

        collection_frequencies[query_term_id] += document_term_frequency
        inverted_index[query_term_id][int_doc_id] = document_term_frequency

avg_doc_length = total_terms / num_documents

print('Inverted index creation took', time.time() - start_time, 'seconds.')

Gathering statistics about 456 terms.


NameError: name 'start_time' is not defined

### Task 2: Latent Semantic Models (LSMs) [20 points] ###

In this task you will experiment with applying distributional semantics methods ([LSI](http://lsa3.colorado.edu/papers/JASIS.lsi.90.pdf) **[5 points]** and [LDA](https://www.cs.princeton.edu/~blei/papers/BleiNgJordan2003.pdf) **[5 points]**) for retrieval.

You do not need to implement LSI or LDA on your own. Instead, you can use [gensim](http://radimrehurek.com/gensim/index.html). An example on how to integrate Pyndri with Gensim for word2vec can be found [here](https://github.com/cvangysel/pyndri/blob/master/examples/word2vec.py). For the remaining latent vector space models, you will need to implement connector classes (such as `IndriSentences`) by yourself.

In order to use a latent semantic model for retrieval, you need to:
   * build a representation of the query **q**,
   * build a representation of the document **d**,
   * calculate the similarity between **q** and **d** (e.g., cosine similarity, KL-divergence).
     
The exact implementation here depends on the latent semantic model you are using. 
   
Each of these LSMs come with various hyperparameters to tune. Make a choice on the parameters, and explicitly mention the reasons that led you to these decisions. You can use the validation set to optimize hyper parameters you see fit; motivate your decisions. In addition, mention clearly how the query/document representations were constructed for each LSM and explain your choices.

In this experiment, you will first obtain an initial top-1000 ranking for each query using TF-IDF in **Task 1**, and then re-rank the documents using the LSMs. Use TREC Eval to obtain the results and report on `NDCG@10`, Mean Average Precision (`MAP@1000`), `Precision@5` and `Recall@1000`.

Perform significance testing **[5 points]** (similar as in Task 1) in the class of semantic matching methods.

Perform analysis **[5 points]**

In [22]:
from gensim.models.keyedvectors import KeyedVectors
import pyndri.compat
import gensim
from gensim import corpora, models
import scipy.spatial.distance as ssd
import random
import copy
from scipy import stats
from operator import itemgetter
import numpy as np
import os
from subprocess import Popen, PIPE

In [10]:
# Helper cell, that returns two dictionaries. These dictionaries can be used to convert 
# docIDs to external docIDs. Internal docIDS are 1,2,3,...,164597
# external docIDs are AP-XXXXXX-XXXX

ext2int_ids = {}
int2ext_ids = {}
        
for int_doc_id in range(index.document_base(), index.maximum_document()):
    ext_doc_id, _ = index.document(int_doc_id)
    ext2int_ids[ext_doc_id] = int_doc_id
    int2ext_ids[int_doc_id] = ext_doc_id
    
#print (int2ext_ids[1])
#print (ext2int_ids['AP890425-0001'])

## RETRIEVING THE MODEL

In [11]:
# Training is done by creating a very large BOW representation for all the documents in the collection
# These models are saved, so training has to be done only once. 

In [12]:
# A function that returns a BOW representation of the entire document set (160k documents)
# It loops through al the documents, and appends them to a list
# RETURNS
# - bow representation of the corpus
# - token2id, a dictionary for converting tokens (words) to ids
# - id2token, another dictionary for converting ids to words (tokens)

def document_bow():
    dictionary = pyndri.extract_dictionary(index) 
    token2id, id2token, _ = index.get_dictionary() # Only id2token is necessary
    documents_list = [] # The list that all the documents will be appended to
    
    for i in range(1,num_documents+1):
        _ , doc = index.document(i)
        doc = [id2token[word_id] for word_id in doc if word_id > 0]
        documents_list.append(doc)
        
    bow_corpus = [dictionary.doc2bow(text) for text in documents_list]
    return bow_corpus, token2id, id2token

corpus, token2id, id2token = document_bow()

In [None]:
# Train LDA-model
num_topics = 20
lda20 = gensim.models.ldamodel.LdaModel(corpus=corpus, num_topics=num_topics, id2word = id2token, passes=10)
lda20.save('LDAmodels/LDA20')

In [None]:
# Train LSI-model

num_topics = 250
lsi250 = gensim.models.lsimodel.LsiModel(corpus=corpus, id2word=id2token, num_topics=num_topics)
lsi250.save('LSImodels/LSI250')
'''
num_topics = 500
lsi500 = gensim.models.lsimodel.LsiModel(corpus=corpus, id2word=id2token, num_topics=num_topics)
lsi100.save('LSImodels/LSI500')

num_topics = 1000
lsi1000 = gensim.models.lsimodel.LsiModel(corpus=corpus, id2word=id2token, num_topics=num_topics)
lsi1000.save('LSImodels/LSI1000')
'''

## PREPROCESSING

In [13]:
# A function that returns a document given an external doc_id
# INPUT
# - an internal doc_id, a string, for instance: APXXXXX-XX
# RETURNS
# - an list of strings representing a text, for instance ['joris','is','de','beste']
def get_document(doc_id):
    int_doc_id = ext2int_ids[doc_id]
    _,text_ids = index.document(int_doc_id)
    return [id2token[word_id] for word_id in text_ids if word_id > 0]       

In [14]:
# Function that takes returns a list of indices, converting a text into indices
# INPUT
# - a list of strings, for instance ['python','wizard']
# RETURNS
# - a list of IDS, for instance [1,2] where 'python' maps to 1, and 'wizard' maps to 2
# according to the token2id dictionary
def text_to_ind(text):
    token_ids = [token2id[token] for token in text if token in token2id] 
    return token_ids

In [15]:
# A function that generates the topic vector from a text. This can be a document or a query
# INPUT
# - a text, a list of strings, for instance ['python','wizard']
# - a model, either LSI or LDA model
# RETURNS 
# - a topic scores vector, with scores
def topic_vector_from_text(text,model):
    random.shuffle(text) # randomly shuffle text, to really create a BOW
    bow_ids = text_to_ind(text)
    bow_ids_counter = collections.Counter(bow_ids)
    bow_list = [[key,value] for key,value in bow_ids_counter.items()]
    topics_scores = model[bow_list] # topics_scores is of type: [ (1,score1),(2,score2),etc,etc]
    
    topic_ids = [topic_id for topic_id,score in topics_scores]
    topic_scores = [score for topic_id,score in topics_scores]
    num_topics = model.num_topics
    
    scores = np.zeros(num_topics)
    scores[topic_ids] = topic_scores
    return scores

In [16]:
# Function that calculates the cosine similarity between two vectors
# INPUT
# - query_topic_vec, the query topic vector
# - doc_topic_vec, the document topic vector
# RETURNS
# - the cosine similarity between query_vec and doc_vec, a float
def check_similarity(query_topic_vec,doc_topic_vec):
    return ssd.cosine(query_topic_vec,doc_topic_vec)    

In [17]:
# Function that scores a list of document-ids based on a LSI or LDA model
# INPUT
# - a model, this can be either a LSI or LDA model
# - a query, this is a list of strings, in the form ['python','wizard']
# - a list of doc_ids, of the form [APXXXXX-XX,APYYYYY-YY, etc, etc]. This list is probably of length 1000
# RETURNS
# - a list of the 1000 documents, a list of form [(APYYYYY-YY,scoreY),(APXXXXX-XX,scoreX),(etc,score_etc),...],
# , where the first element of the list is the best query result, and the last element of the list is the worst 
def lsi_lda_model_score(model,query,doc_ids):
    query_topic_vec = topic_vector_from_text(query,model)    
    
    ranking = []
    
    for ext_doc_ID in doc_ids:
        doc = get_document(ext_doc_ID) # returns the document in text
        doc_topic_vec = topic_vector_from_text(doc,model)
        score = check_similarity(query_topic_vec,doc_topic_vec)
        ranking.append((ext_doc_ID,score))
        
    ranking.sort(key=itemgetter(1))
    
    return ranking

In [18]:
# a function that reads the tfidf.run file
# INPUT
# - filename, a string
# RETURNS
# - a dictionary with takes a query_id (51,200) as value, and returns a list of a 1000 documents [APXXXX-XX,etc,...]
def read_tfidf_file(filename):
    
    return_dict = {}
    
    # Fill the dictionary with empty lists:
    # Queries go from 51 to 200
    #for i, qid in enumerate(queries):
     #   return_dict[int(qid)] = []
      #  print (qid)
        
    for qid in valid_query_ids:
        return_dict[int(qid)] = []
    
    with open(filename,'r') as fn:
        for line in fn:
            query_id = int(line.split()[0])
            ext_doc_ID = line.split()[2]
            
            temp_list = return_dict[query_id] # ugly solution this, but if its stupid but it works,
            temp_list.append(ext_doc_ID) # it aint stupid
            
            return_dict[query_id] = temp_list   
    return return_dict   

In [19]:
# A functions that appends data to a big data list
# INPUT
# - a list to_be_added, of the form [(APYYYYY-YY,scoreY),(APXXXXX-XX,scoreX),(etc,score_etc),...]
# - queryID, an integer
# - model_name a str with the model name
# - a data list of lists within a list, in which to add to_be_added
# - the lists within the list are of the form [queryID,'Q0',exc_doc_ID,rank,score,'modelname']
# RETURNS
# - data_list a list of list. Each list in the list is of the form [queryID,'Q0',ext_doc_ID,rank,score,model_name] 
def append_data(to_be_added,query_id,model_name,data_list):
    for i in range(0,len(to_be_added)):
        temp_list = []
        temp_list.append(query_id)
        temp_list.append('Q0')
        temp_list.append(to_be_added[i][0])
        temp_list.append(i+1) # rank
        temp_list.append(to_be_added[i][1])
        temp_list.append(model_name)
        data_list.append(temp_list)
    return data_list         

In [20]:
# Function that writes the data to a file
# INPUT
# - model_name, a string of the desired name of the output file
# - data, a list of list. Each list in the list is of the form [queryID,'Q0',ext_doc_ID,rank,score,model_name]
# RETURNS
# - writes data to a file, returns a file in the same folder as the .ipynb notebook
def write_model(model_name,data):
    with open(model_name,'w') as mn:
        for row in data:
            for term in row:
                mn.write(str(term) + ' ')
            mn.write('\n')
    mn.close()
    return

In [21]:
lda_model = gensim.models.ldamodel.LdaModel.load('LDAmodels/LDA20')
lsi_model = gensim.models.lsimodel.LsiModel.load('LSImodels/LSI100')

In [None]:
query_return_dict = read_tfidf_file('tfidf_valid.run')
#lsi_model = gensim.models.lsimodel.LsiModel.load('LSImodels/LSI250') 
lda_model = gensim.models.ldamodel.LdaModel.load('LDAmodels/LDA20')
model_name = 'LDA20.run'
data = []

for query,values in query_return_dict.items():
    query_in_text = queries[str(query)].lower().split()

    new_ranking = lsi_lda_model_score(lda_model,query_in_text,query_return_dict[query])
    data = append_data(new_ranking,int(query),model_name,data)
    
write_model(model_name,data)
print (model_name+'file created!')

## ANALYSIS

In [None]:
def write_results(model_name):
    '''
    A function that writes the result to a text file
    INPUT
    - a model_name, this can be LSI50, LSI100, LSI250, LDA10 or LDA20 or another string
    RETURNS
    - a .txt file in the folder results. Each line contains the model, the query id, and the score
    
    '''
    output_file_name = 'Results/results'+model_name+'.txt'
    
    with open(output_file_name,'w') as file:
        
        command = './eval'+model_name + '.sh'
        proc = Popen (command,shell=True,stdout = PIPE)
        out,err = proc.communicate()
        
        result_list = out.decode('utf-8').split('\n')
        for result in result_list:
            line = result.split('\t')
            
            write_list = []
            
            for i in line:
                write_list.append(str(i))
                file.write(write_list[-1] + ' ')
            file.write('\n')
        file.close()
    return
                
write_results('LSI50')
write_results('LSI100')
write_results('LSI250')
write_results('LDA10')
write_results('LDA20')

In [None]:
with open('Results/resultsLSI50.txt','r') as lsi50:
    
    map_scores_lsi50 = []
    p5_scores_lsi50 = []
    recall1000_scores_lsi50 = []
    ndcg10_scores_lsi50 = []
    
    for line in lsi50.readlines():
        result_line = line.split()
        if result_line[0] == 'map' and result_line[1] != 'all' and result_line != []:
            map_scores_lsi50.append(float(result_line[-1]))
        elif result_line[0] == 'P_5' and result_line[1] != 'all':
            p5_scores_lsi50.append(float(result_line[-1]))
        elif result_line[0] == 'recall_1000' and result_line[1] != 'all':
            recall1000_scores_lsi50.append(float(result_line[-1]))
        elif result_line[0] == 'ndcg_cut_10' and result_line[1] != 'all':
            ndcg10_scores_lsi50.append(float(result_line[-1]))

with open('Results/resultsLSI250.txt','r') as lsi250:
        
    map_scores_lsi250 = []
    p5_scores_lsi250 = []
    recall1000_scores_lsi250 = []
    ndcg10_scores_lsi250 = []
    
    for line in lsi250.readlines():
        result_line = line.split()
        if result_line[0] == 'map' and result_line[1] != 'all' and result_line != []:
            map_scores_lsi250.append(float(result_line[-1]))
        elif result_line[0] == 'P_5' and result_line[1] != 'all':
            p5_scores_lsi250.append(float(result_line[-1]))
        elif result_line[0] == 'recall_1000' and result_line[1] != 'all':
            recall1000_scores_lsi250.append(float(result_line[-1]))
        elif result_line[0] == 'ndcg_cut_10' and result_line[1] != 'all':
            ndcg10_scores_lsi250.append(float(result_line[-1]))
            
with open('Results/resultsLDA20.txt','r') as lda20:
        
    map_scores_lda20 = []
    p5_scores_lda20 = []
    recall1000_scores_lda20 = []
    ndcg10_scores_lda20 = []
    
    for line in lda20.readlines():
        result_line = line.split()
        if result_line[0] == 'map' and result_line[1] != 'all' and result_line != []:
            map_scores_lda20.append(float(result_line[-1]))
        elif result_line[0] == 'P_5' and result_line[1] != 'all':
            p5_scores_lda20.append(float(result_line[-1]))
        elif result_line[0] == 'recall_1000' and result_line[1] != 'all':
            recall1000_scores_lda20.append(float(result_line[-1]))
        elif result_line[0] == 'ndcg_cut_10' and result_line[1] != 'all':
            ndcg10_scores_lda20.append(float(result_line[-1]))

### Task 3:  Word embeddings for ranking [10 points] ###

First create word embeddings on the corpus we provided using [word2vec](http://arxiv.org/abs/1411.2738) -- [gensim implementation](https://radimrehurek.com/gensim/models/word2vec.html). You should extract the indexed documents using pyndri and provide them to gensim for training a model (see example [here](https://github.com/nickvosk/pyndri/blob/master/examples/word2vec.py)).

Try one of the following (increasingly complex) methods for building query and document representations:
   * Average or sum the word vectors.
   * Cluster words in the document using [k-means](http://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html) and use the centroid of the most important cluster. Experiment with different values of K for k-means.
   * Using the [bag-of-word-embeddings representation](https://ciir-publications.cs.umass.edu/pub/web/getpdf.php?id=1248).
   
Note that since we provide the implementation for training word2vec, you will be graded based on your creativity on combining word embeddings for building query and document representations.

Note: If you want to experiment with pre-trained word embeddings on a different corpus, you can use the word embeddings we provide alongside the assignment (./data/reduced_vectors_google.txt). These are the [google word2vec word embeddings](https://code.google.com/archive/p/word2vec/), reduced to only the words that appear in the document collection we use in this assignment.

### Task 4: Learning to rank (LTR) [10 points] ###

In this task you will get an introduction into learning to rank for information retrieval, in particular pointwise learning to rank.

You will experiment with a pointwise learning to rank method, logistic regression, implemented in [scikit-learn](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html).
Train your LTR model using 10-fold cross validation on the test set.

You can explore different ways for devising features for the model. Obviously, you can use the retrieval methods you implemented in Task 1 and Task 2 as features. Think about other features you can use (e.g. query/document length). 
One idea is to also explore external sources such as Wikipedia entities (?). Creativity on devising new features and providing motivation for them will be taken into account when grading.

For every query, first create a document candidate set using the top-1000 documents using TF-IDF, and subsequently compute features given a query and a document. Note that the feature values of different retrieval methods are likely to be distributed differently.

### Task 4: Write a report [20 points; instant FAIL if not provided] ###

The report should be a PDF file created using the [sigconf ACM template](https://www.acm.org/publications/proceedings-template) and will determine a significant part of your grade.

   * It should explain what you have implemented, motivate your experiments and detail what you expect to learn from them. **[10 points]**
   * Lastly, provide a convincing analysis of your results and conclude the report accordingly. **[10 points]**
      * Do all methods perform similarly on all queries? Why?
      * Is there a single retrieval model that outperforms all other retrieval models (i.e., silver bullet)?
      * ...

**Hand in the report and your self-contained implementation source files.** Only send us the files that matter, organized in a well-documented zip/tgz file with clear instructions on how to reproduce your results. That is, we want to be able to regenerate all your results with minimal effort. You can assume that the index and ground-truth information is present in the same file structure as the one we have provided.
