## CSC 575 HW#4

https://nbviewer.jupyter.org/url/condor.depaul.edu/ntomuro/courses/575/assign/575hw4.ipynb

<p><strong>Overview</strong>:</p>
<p>Implement the 'Inverted Index Retrieval Algorithm' (in
<a href="http://condor.depaul.edu/ntomuro/courses/575/notes/VS-Retrieval.pptx">
lecture note (#6)</a>) and the evaluation metric Mean Average Precision (MAP) 
(in <a href="http://condor.depaul.edu/ntomuro/courses/575/notes/Evaluation.pptx">
lecture note (#7)</a>), and apply to a corpus called <a href="http://ir.dcs.gla.ac.uk/resources/test_collections/">Medline 
collection</a>.&nbsp; 
</p>
<p>The Medline 
collection is one of the Information Retrieval (IR) standard test 
collections, which have been used by many researchers as benchmark to evaluate IR 
systems.&nbsp; It contains 1033 documents (abstracts of papers published on 
Medline), 30 queries and relevance judgments of all query-document pairs.&nbsp; 
</p>

### Programming: Vector-space Retrieval & Evaluation -- Partially filled code

### (1) Step 1: Load Inverted Index (H) and compute DocLen (DL).

In [3]:
import csv
import math

tindexfile = 'medline_term_index.csv'
invindexfile = 'medline_inverted_index.csv'
dindexfile = 'medline_doc_index.csv'

# Number of documents in the corpus (hard-coded for this corpus)
N = 1033

# Major data structures
H_invindex = {} # inverted index; term -> (idf, L:hashmap of (docID . tf))
DL_doclen = {}  # document lengths; docID -> len

## (1) Read the term index file and populate the invindex first
tid2term_map = {} # temporary storage to hold mappings of termID -> term

fin = open(tindexfile, 'r', encoding='utf-8')
reader = csv.reader(fin, delimiter='\t')
for line in reader:
    term = line[0]    # term string
    termID = line[1]  # termID
    df = int(line[2]) # document frequency
    idf = math.log10(N/df) # idf
    # record term -> (idf, emptyL) in H
    H_invindex[term] = (idf, dict())
    # record termID -> term 
    tid2term_map[termID] = term 
fin.close()

## (2) Read the inverted index file and add postings lists in H.
## Also compute document lengths too, incrementally -- and record in DL.
fin = open(invindexfile, 'r')
reader = csv.reader(fin, delimiter='\t')
for line in reader:
    termID = line[0]
    idx = 1
    while idx < (len(line)-1):
        docID = line[idx]
        tf = int(line[idx+1]) # raw tf of the term in this document
        # Record docID -> tf in term's L
        L = (H_invindex[tid2term_map[termID]])[1]
        L[docID] = tf  # docID -> raw term frequency
        
        # Accumulate the component vector length for the document
        tfidf = tf * (H_invindex[tid2term_map[termID]])[0] # tf * idf
        tfidfsq = math.pow(tfidf, 2.0)
        if docID in DL_doclen:
            DL_doclen[docID] += tfidfsq
        else:
            DL_doclen[docID] = tfidfsq
        #
        idx += 2
fin.close()

# Fix the DL entries by applying sqrt to make vector length.
for docID in DL_doclen.keys():
    val = DL_doclen[docID]
    DL_doclen[docID] = math.sqrt(val)

    
print ('Total # terms: %d' % len(H_invindex))
for term in ['pentobarbit', 'defici', 'treatment']:
    print (' - Entry for \'%s\': df=%s, idf=%s' % (term, len(H_invindex[term][1]), H_invindex[term][0]))

print ('\nTotal # documents: %d' % len(DL_doclen))
for docID in ['59', '1033']:
    print (' - Vector len for Doc %s = %s' % (docID, DL_doclen[docID]))


Total # terms: 11463
 - Entry for 'pentobarbit': df=4, idf=2.412040330191658
 - Entry for 'defici': df=39, idf=1.4230357144931214
 - Entry for 'treatment': df=172, idf=0.7785718746120717

Total # documents: 1033
 - Vector len for Doc 59 = 13.811725366348801
 - Vector len for Doc 1033 = 31.163653356034512


### (2) Step 2: Queries as Vectors

In [4]:
import re
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.corpus import stopwords

queryfile = 'medline.query'

# A list of queries. Each Query is a tuple, (qID, Q:term->tf map)
Queries_list = []

fin = open(queryfile, 'r', encoding='utf-8')#'iso-8859-1')
porter = nltk.PorterStemmer()

for line in fin:
    matchObj = re.match(r'^(\d+)\s+(.*)', line)
    if not matchObj:
        print ("ERROR with line -- %s" % line)
    else:
        queryID = matchObj.group(1) # queryID
        text = matchObj.group(2)    # query string (ignoring sentences)

        # process text string -- same processing as one applied to documents.
        tokens = word_tokenize(text.lower())
        terms = [porter.stem(w) for w in tokens if w not in stopwords.words('english') and len(w) > 1] # (**)
        # term frequencies of the terms in this query are obtained by NLTK's FreqDist
        fdist = nltk.FreqDist(terms)
        # append the Query in the list
        Queries_list.append((queryID, dict(fdist)))
fin.close()

print ('Total # queries: %d' % len(Queries_list))
for qid in [1, 21]:
    print (' - Query %s: %s' % (Queries_list[qid][0], Queries_list[qid][1]))

Total # queries: 30
 - Query 2: {'relationship': 1, 'blood': 1, 'cerebrospin': 1, 'fluid': 1, 'oxygen': 1, 'concentr': 1, 'partial': 1, 'pressur': 1, 'method': 1, 'interest': 1, 'polarographi': 1}
 - Query 22: {'mycoplasma': 1, 'infect': 1, 'presenc': 1, 'embryo': 1, 'fetu': 1, 'newborn': 1, 'infant': 1, 'anim': 1, 'pregnanc': 1, 'gynecolog': 1, 'diseas': 1, 'relat': 1, 'chromosom': 2, 'abnorm': 1}


In [5]:
print ('Total # terms: %d' % len(H_invindex))
for term in ['pentobarbit']:
    print (' - Entry for \'%s\': df=%s, idf=%s' % (term, len(H_invindex[term][1]), H_invindex[term][0]))
    print(H_invindex[term])
    print()

print ('\nTotal # documents: %d' % len(DL_doclen))
for docID in ['59', '1033']:
    print (' - Vector len for Doc %s = %s' % (docID, DL_doclen[docID]))

Total # terms: 11463
 - Entry for 'pentobarbit': df=4, idf=2.412040330191658
(2.412040330191658, {'187': 1, '301': 1, '416': 1, '419': 1})


Total # documents: 1033
 - Vector len for Doc 59 = 13.811725366348801
 - Vector len for Doc 1033 = 31.163653356034512


## (3) Retrieval and Ranking -- Step 3,4,5 for each Query

### - For each query, follow Step 3,4,5 of the Vector Space Retrieval Algorithm and obtain a ranked list of retrieved documents (sorted by the cosine measure) -- 'R' in the algorithm.  
### - Then save the ranked lists in a list (in the same order of the query) -- to be used in the next step/Evaluation.
### - (**) Also, WRITE the ranked lists to an output file.  See the homework page for details.

In [6]:
i = 0
for key in H_invindex:
    print(key+' '+str(H_invindex[key]))
    if(i==5):break
    i+=1



'' (1.4230357144931214, {'497': 1, '545': 1, '556': 1, '560': 1, '565': 1, '592': 2, '602': 2, '604': 4, '606': 1, '608': 6, '611': 1, '612': 2, '613': 1, '615': 2, '624': 1, '625': 1, '630': 2, '633': 2, '640': 1, '647': 1, '660': 1, '678': 1, '691': 1, '699': 1, '709': 1, '713': 1, '715': 2, '720': 2, '738': 1, '740': 3, '743': 1, '750': 1, '752': 1, '789': 1, '798': 1, '801': 1, '807': 1, '808': 3, '817': 2})
'a (2.7130703258556395, {'421': 1, '609': 1})
'achondroplast (3.0141003215196207, {'576': 1})
'adequ (3.0141003215196207, {'421': 1})
'agnos (3.0141003215196207, {'358': 2})
'air (3.0141003215196207, {'70': 1})


In [29]:
# RETRIEVE IDF IN QUERY
from collections import defaultdict
from scipy.spatial import distance
import numpy as np
import pandas as pd
docs_tfidf = defaultdict(list)
i=0
for q in Queries_list:
#     doc_w_terms=defaultdict(list)
    q_tfidf=[]
    doc_w_terms=[]
    print(sorted(q[1]))
    for token in sorted(q[1].keys()): # ITERATE OVER QUERY TERMS
#         print("QUERY TERM: "+token+' '+str(H_invindex[token][0])+'x'+str(q[1][token]))
        qt_tfidf=H_invindex[token][0]*q[1][token]
        q_tfidf.append(qt_tfidf)
#         print("DOCS: ",end="")
#         print(H_invindex[token][1])
        for docid in H_invindex[token][1].keys():
#             print('\t'+str(docid)+': '+str(H_invindex[token][1][docid])+' x '+str(H_invindex[token][0])+' = '
#                   +str(H_invindex[token][1][docid]*H_invindex[token][0]))
            # TODO: build list/dict of this info for all docs having this term
            docs_tfidf.setdefault(token,[]).append({docid:H_invindex[token][1][docid]*H_invindex[token][0]})
#             if (docid not in doc_w_terms[q[0]]):
#                 doc_w_terms.setdefault(q[0],[]).append(docid)
            if(docid not in doc_w_terms):
                doc_w_terms.append(docid)

    print('transformed query: '+str(q_tfidf))
#     print('docs tfidf: ')
#     print(docs_tfidf)
    # CREATE VECTORS OF TFIDFs
    query_docs=np.zeros([len(q[1].keys()),len(doc_w_terms)])
    query_docs=pd.DataFrame(data=query_docs,columns=doc_w_terms)
    print(query_docs.shape)
#     print(sorted(docs_tfidf))
#     print(query_docs)
    
#     print(query_docs.shape)
#     print('PRINTING LEN')
#     print(docs_tfidf['len'])
#     for docid in doc_w_terms:
    j=0
    for token in sorted(q[1].keys()):
        for tup in docs_tfidf[token]:
#             print([k for k,v in tup.items()][0])
            for k,v in tup.items():
                query_docs[k][j]=v
        j+=1

    print(query_docs)
    
    query_docs = query_docs.transpose()
#     query_docs = np.array(query_docs)
#     print(query_docs)
    scores=[]
    for index,row in query_docs.iterrows():
        scores.append(distance.cosine(row,q_tfidf))
#     print(np.sort([doc_w_terms,scores]))
    
    i+=1
    if(i==1):break
query_docs.to_csv('text.csv')

print()
for doc in docs_tfidf:
    print(docs_tfidf[doc])
    print()

['crystallin', 'human', 'includ', 'len', 'vertebr']
transformed query: [2.169002281505364, 0.957195470183148, 1.1390390581279206, 1.401316464799885, 2.412040330191658]
(5, 223)
         72       175       181       336        500       509       549  \
0  8.676009  2.169002  2.169002  2.169002  10.845011  2.169002  2.169002   
1  0.000000  0.000000  0.000000  0.000000   0.000000  0.000000  0.000000   
2  0.000000  0.000000  0.000000  0.000000   0.000000  0.000000  0.000000   
3  4.203949  0.000000  4.203949  0.000000   4.203949  2.802633  0.000000   
4  0.000000  0.000000  0.000000  0.000000   0.000000  0.000000  0.000000   

          9        11        41    ...           506       507       508  \
0  0.000000  0.000000  0.000000    ...      0.000000  0.000000  0.000000   
1  1.914391  2.871586  2.871586    ...      0.000000  0.000000  0.000000   
2  0.000000  0.000000  1.139039    ...      0.000000  0.000000  0.000000   
3  0.000000  0.000000  0.000000    ...      2.802633  1.401316

In [11]:
q = np.array(q_tfidf)
testdoc = np.array(query_docs[0:1][:])[0]
print(Queries_list[0])
print(q)
print(testdoc)
distance.cosine(testdoc,q)
# print(distance.cosine(np.array(query_docs[0:1][:])[0],np.array(q_tdidf)))
docs_tfidf
####################################
# PROBLEM WAS WRONG ORDER OF KEY IN DEFAULT DICT docs_tfidf

('1', {'crystallin': 1, 'len': 1, 'vertebr': 1, 'includ': 1, 'human': 1})
[2.16900228 0.95719547 1.13903906 1.40131646 2.41204033]
[8.67600913 4.20394939 0.         0.         0.        ]


defaultdict(list,
            {'crystallin': [{'72': 8.676009126021455},
              {'175': 2.169002281505364},
              {'181': 2.169002281505364},
              {'336': 2.169002281505364},
              {'500': 10.845011407526819},
              {'509': 2.169002281505364},
              {'549': 2.169002281505364}],
             'human': [{'9': 1.914390940366296},
              {'11': 2.871586410549444},
              {'41': 2.871586410549444},
              {'42': 0.957195470183148},
              {'58': 4.78597735091574},
              {'65': 4.78597735091574},
              {'67': 0.957195470183148},
              {'70': 0.957195470183148},
              {'78': 1.914390940366296},
              {'82': 0.957195470183148},
              {'83': 0.957195470183148},
              {'87': 0.957195470183148},
              {'95': 0.957195470183148},
              {'99': 3.828781880732592},
              {'108': 0.957195470183148},
              {'125': 0.957195470183148},
         

## (4) Evaluation -- Compute MAP
### - Read the relevancy answers from the file "medline.rel".
### - Compare the ranked lists with the anwers, and compute the MAP score.
### - (**) Also print the MAP score (to the terminal).