## CSC 575 HW#4

https://nbviewer.jupyter.org/url/condor.depaul.edu/ntomuro/courses/575/assign/575hw4.ipynb

<p><strong>Overview</strong>:</p>
<p>Implement the 'Inverted Index Retrieval Algorithm' (in
<a href="http://condor.depaul.edu/ntomuro/courses/575/notes/VS-Retrieval.pptx">
lecture note (#6)</a>) and the evaluation metric Mean Average Precision (MAP) 
(in <a href="http://condor.depaul.edu/ntomuro/courses/575/notes/Evaluation.pptx">
lecture note (#7)</a>), and apply to a corpus called <a href="http://ir.dcs.gla.ac.uk/resources/test_collections/">Medline 
collection</a>.&nbsp; 
</p>
<p>The Medline 
collection is one of the Information Retrieval (IR) standard test 
collections, which have been used by many researchers as benchmark to evaluate IR 
systems.&nbsp; It contains 1033 documents (abstracts of papers published on 
Medline), 30 queries and relevance judgments of all query-document pairs.&nbsp; 
</p>

### Programming: Vector-space Retrieval & Evaluation -- Partially filled code

### (1) Step 1: Load Inverted Index (H) and compute DocLen (DL).

In [4]:
import csv
import math

tindexfile = 'medline_term_index.csv'
invindexfile = 'medline_inverted_index.csv'
dindexfile = 'medline_doc_index.csv'

# Number of documents in the corpus (hard-coded for this corpus)
N = 1033

# Major data structures
H_invindex = {} # inverted index; term -> (idf, L:hashmap of (docID . tf))
DL_doclen = {}  # document lengths; docID -> len

## (1) Read the term index file and populate the invindex first
tid2term_map = {} # temporary storage to hold mappings of termID -> term

fin = open(tindexfile, 'r', encoding='utf-8')
reader = csv.reader(fin, delimiter='\t')
for line in reader:
    term = line[0]    # term string
    termID = line[1]  # termID
    df = int(line[2]) # document frequency
    idf = math.log10(N/df) # idf
    # record term -> (idf, emptyL) in H
    H_invindex[term] = (idf, dict())
    # record termID -> term 
    tid2term_map[termID] = term 
fin.close()

## (2) Read the inverted index file and add postings lists in H.
## Also compute document lengths too, incrementally -- and record in DL.
fin = open(invindexfile, 'r')
reader = csv.reader(fin, delimiter='\t')
for line in reader:
    termID = line[0]
    idx = 1
    while idx < (len(line)-1):
        docID = line[idx]
        tf = int(line[idx+1]) # raw tf of the term in this document
        # Record docID -> tf in term's L
        L = (H_invindex[tid2term_map[termID]])[1]
        L[docID] = tf  # docID -> raw term frequency
        
        # Accumulate the component vector length for the document
        tfidf = tf * (H_invindex[tid2term_map[termID]])[0] # tf * idf
        tfidfsq = math.pow(tfidf, 2.0)
        if docID in DL_doclen:
            DL_doclen[docID] += tfidfsq
        else:
            DL_doclen[docID] = tfidfsq
        #
        idx += 2
fin.close()

# Fix the DL entries by applying sqrt to make vector length.
for docID in DL_doclen.keys():
    val = DL_doclen[docID]
    DL_doclen[docID] = math.sqrt(val)

    
print ('Total # terms: %d' % len(H_invindex))
for term in ['pentobarbit', 'defici', 'treatment']:
    print (' - Entry for \'%s\': df=%s, idf=%s' % (term, len(H_invindex[term][1]), H_invindex[term][0]))

print ('\nTotal # documents: %d' % len(DL_doclen))
for docID in ['59', '1033']:
    print (' - Vector len for Doc %s = %s' % (docID, DL_doclen[docID]))


Total # terms: 11463
 - Entry for 'pentobarbit': df=4, idf=2.412040330191658
 - Entry for 'defici': df=39, idf=1.4230357144931214
 - Entry for 'treatment': df=172, idf=0.7785718746120717

Total # documents: 1033
 - Vector len for Doc 59 = 13.811725366348801
 - Vector len for Doc 1033 = 31.163653356034512


### (2) Step 2: Queries as Vectors

In [5]:
import re
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.corpus import stopwords

queryfile = 'medline.query'

# A list of queries. Each Query is a tuple, (qID, Q:term->tf map)
Queries_list = []

fin = open(queryfile, 'r', encoding='utf-8')#'iso-8859-1')
porter = nltk.PorterStemmer()

for line in fin:
    matchObj = re.match(r'^(\d+)\s+(.*)', line)
    if not matchObj:
        print ("ERROR with line -- %s" % line)
    else:
        queryID = matchObj.group(1) # queryID
        text = matchObj.group(2)    # query string (ignoring sentences)

        # process text string -- same processing as one applied to documents.
        tokens = word_tokenize(text.lower())
        terms = [porter.stem(w) for w in tokens if w not in stopwords.words('english') and len(w) > 1] # (**)
        # term frequencies of the terms in this query are obtained by NLTK's FreqDist
        fdist = nltk.FreqDist(terms)
        # append the Query in the list
        Queries_list.append((queryID, dict(fdist)))
fin.close()

print ('Total # queries: %d' % len(Queries_list))
for qid in [1, 21]:
    print (' - Query %s: %s' % (Queries_list[qid][0], Queries_list[qid][1]))

Total # queries: 30
 - Query 2: {'relationship': 1, 'blood': 1, 'cerebrospin': 1, 'fluid': 1, 'oxygen': 1, 'concentr': 1, 'partial': 1, 'pressur': 1, 'method': 1, 'interest': 1, 'polarographi': 1}
 - Query 22: {'mycoplasma': 1, 'infect': 1, 'presenc': 1, 'embryo': 1, 'fetu': 1, 'newborn': 1, 'infant': 1, 'anim': 1, 'pregnanc': 1, 'gynecolog': 1, 'diseas': 1, 'relat': 1, 'chromosom': 2, 'abnorm': 1}


In [6]:
print ('Total # terms: %d' % len(H_invindex))
for term in ['pentobarbit']:
    print (' - Entry for \'%s\': df=%s, idf=%s' % (term, len(H_invindex[term][1]), H_invindex[term][0]))
    print(H_invindex[term])
    print()

print ('\nTotal # documents: %d' % len(DL_doclen))
for docID in ['59', '1033']:
    print (' - Vector len for Doc %s = %s' % (docID, DL_doclen[docID]))

Total # terms: 11463
 - Entry for 'pentobarbit': df=4, idf=2.412040330191658
(2.412040330191658, {'187': 1, '301': 1, '416': 1, '419': 1})


Total # documents: 1033
 - Vector len for Doc 59 = 13.811725366348801
 - Vector len for Doc 1033 = 31.163653356034512


## (3) Retrieval and Ranking -- Step 3,4,5 for each Query

### - For each query, follow Step 3,4,5 of the Vector Space Retrieval Algorithm and obtain a ranked list of retrieved documents (sorted by the cosine measure) -- 'R' in the algorithm.  
### - Then save the ranked lists in a list (in the same order of the query) -- to be used in the next step/Evaluation.
### - (**) Also, WRITE the ranked lists to an output file.  See the homework page for details.

In [28]:
i = 0
for key in H_invindex:
    print(key+' '+str(H_invindex[key]))
    if(i==5):break
    i+=1



'' (1.4230357144931214, {'497': 1, '545': 1, '556': 1, '560': 1, '565': 1, '592': 2, '602': 2, '604': 4, '606': 1, '608': 6, '611': 1, '612': 2, '613': 1, '615': 2, '624': 1, '625': 1, '630': 2, '633': 2, '640': 1, '647': 1, '660': 1, '678': 1, '691': 1, '699': 1, '709': 1, '713': 1, '715': 2, '720': 2, '738': 1, '740': 3, '743': 1, '750': 1, '752': 1, '789': 1, '798': 1, '801': 1, '807': 1, '808': 3, '817': 2})
'a (2.7130703258556395, {'421': 1, '609': 1})
'achondroplast (3.0141003215196207, {'576': 1})
'adequ (3.0141003215196207, {'421': 1})
'agnos (3.0141003215196207, {'358': 2})
'air (3.0141003215196207, {'70': 1})


In [105]:
# RETRIEVE IDF IN QUERY
from collections import defaultdict
d = defaultdict(list)
i=1
for q in Queries_list:
    doc_w_terms={}
    print(q[1])
    for token in q[1].keys():
        print("QUERY TERM: "+token+' '+str(H_invindex[token][0])+'x'+str(q[1][token]))
        print("DOCS: ",end="")
        print(H_invindex[token][1])
        for docid in H_invindex[token][1].keys():
            print('\t'+str(docid)+': '+str(H_invindex[token][1][docid])+' x '+str(H_invindex[token][0])+' = '
                  +str(H_invindex[token][1][docid]*H_invindex[token][0]))
            # TODO: build list/dict of this info for all docs having this term
            d.setdefault(token,[]).append({docid:H_invindex[token][1][docid]*H_invindex[token][0]})
#             doc_w_terms[docid]=H_invindex[token][1][docid]*H_invindex[token][0]
            doc_w_terms.setdefault(docid,H_invindex[token][1][docid]*H_invindex[token][0])
#         print([docid for docid in H_invindex[token][1].keys()])
#         print(doc_w_terms)
        for key in d.keys():
            print(key+str(d[key]))
            print()
        print()
    print()
    i+=1
    if(i==3):break

{'crystallin': 1, 'len': 1, 'vertebr': 1, 'includ': 1, 'human': 1}
QUERY TERM: crystallin 2.169002281505364x1
DOCS: {'72': 4, '175': 1, '181': 1, '336': 1, '500': 5, '509': 1, '549': 1}
	72: 4 x 2.169002281505364 = 8.676009126021455
	175: 1 x 2.169002281505364 = 2.169002281505364
	181: 1 x 2.169002281505364 = 2.169002281505364
	336: 1 x 2.169002281505364 = 2.169002281505364
	500: 5 x 2.169002281505364 = 10.845011407526819
	509: 1 x 2.169002281505364 = 2.169002281505364
	549: 1 x 2.169002281505364 = 2.169002281505364
crystallin[{'72': 8.676009126021455}, {'175': 2.169002281505364}, {'181': 2.169002281505364}, {'336': 2.169002281505364}, {'500': 10.845011407526819}, {'509': 2.169002281505364}, {'549': 2.169002281505364}]


QUERY TERM: len 1.401316464799885x1
DOCS: {'13': 3, '14': 3, '15': 8, '72': 3, '79': 3, '138': 2, '142': 4, '164': 3, '165': 3, '166': 7, '167': 3, '168': 3, '169': 3, '170': 3, '171': 5, '172': 4, '180': 2, '181': 3, '182': 5, '183': 3, '184': 3, '185': 2, '186': 2, '

## (4) Evaluation -- Compute MAP
### - Read the relevancy answers from the file "medline.rel".
### - Compare the ranked lists with the anwers, and compute the MAP score.
### - (**) Also print the MAP score (to the terminal).