<h1><center>Information Retrieval System</center></h1>
<h3><center>Information retrieval system based on ranked retrieval</center></h3>

Information retrieval is the activity of obtaining information system resources that are relevant to an information need from a collection of those resources. Searches can be based on full-text or other content-based indexing.<br>

The information retrieval system here uses tf-idf scores and cosine similarities to retrieve ranked indices of documents most relevant to the need. The dataset used here is a sample dataset where every document is an image and text associated with it are tags. (The images are not attached here)<br>
Upon querying, the query is compared to the tags of every document based on the mentioned scheme and returns ranked (sorted top 10 highest) indices for the images most relevant to the query.<br>

The notebook contains step by step code built for the same. The repo contains a compiled script and a short Flask code converting the information system to a server running locally and fulfilling queries.

In [1]:
import pandas as pd
import numpy as np
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import re
from os import path

In [2]:
df = pd.read_csv('/home/hp/Github/Information-Retrieval-System/TagsDatabase.csv',header=None)
df.columns = ['docID','tags']
df.head(5)

Unnamed: 0,docID,tags
0,0,"Hiker, demon, creepy, scary, tunnel, stalk"
1,1,"Batman, batman beyond, who are you, narrows it..."
2,2,"Up, carl, russell, honor, award, scout badge, ..."
3,3,"Tom, jerry, sword, stab, dont care, cartoon, show"
4,4,"Wholesome, comic, dialogue bubble, dog, sleepi..."


The *tags* column is cleaned by the following steps:
* Remove punctuations
* Lower case
* Strip whitespaces
* Remove stopwords

The docID column has been changed to now represent a document by 'D' followed by the number.

In [3]:
df.docID = pd.Series(["D"+str(ind) for ind in df.docID])
df.head(5)

Unnamed: 0,docID,tags
0,D0,"Hiker, demon, creepy, scary, tunnel, stalk"
1,D1,"Batman, batman beyond, who are you, narrows it..."
2,D2,"Up, carl, russell, honor, award, scout badge, ..."
3,D3,"Tom, jerry, sword, stab, dont care, cartoon, show"
4,D4,"Wholesome, comic, dialogue bubble, dog, sleepi..."


In [4]:
df.tags = df.tags.str.replace(","," ")
df.tags = df.tags.str.replace(r'\W',' ')
df.tags = df.tags.str.strip().str.lower()

Vocabulary (all uniqye words in the documents) are collected into a set as below. This set of vocabulary is  used to match with the query. 

In [5]:
all_text = " ".join(df.tags.values)
vocab = np.unique(word_tokenize(all_text))
vocab = [word for word in vocab if word not in stopwords.words('english')]
vocab

['2',
 '25',
 '4',
 'alone',
 'animated',
 'anime',
 'announce',
 'answer',
 'arm',
 'award',
 'badge',
 'batman',
 'battle',
 'bean',
 'bear',
 'bench',
 'better',
 'beyond',
 'big',
 'bigger',
 'bismol',
 'biting',
 'black',
 'blame',
 'blessing',
 'block',
 'body',
 'bounds',
 'brain',
 'broom',
 'bubble',
 'bustling',
 'callmecarson',
 'cardboard',
 'care',
 'carl',
 'carrying',
 'cartoon',
 'cast',
 'cat',
 'caught',
 'change',
 'charlie',
 'chef',
 'city',
 'classic',
 'cody',
 'college',
 'combine',
 'comic',
 'comparison',
 'contradict',
 'contradictory',
 'creative',
 'creepy',
 'crossover',
 'crusade',
 'crying',
 'cutout',
 'dancing',
 'dark',
 'dead',
 'demon',
 'despicable',
 'destroy',
 'dialogue',
 'dimmadome',
 'dislike',
 'diver',
 'dog',
 'dont',
 'doug',
 'dragonballz',
 'draw',
 'drink',
 'employee',
 'empty',
 'enough',
 'excuse',
 'face',
 'fall',
 'fallout',
 'fast',
 'feeding',
 'ferb',
 'fight',
 'fire',
 'fit',
 'food',
 'fool',
 'forbidden',
 'force',
 'fortr

A term-document-matrix is a mapping of every word in the vocabulary to document. Every document is converted to a vector corresponding to frequency of each word appearing in that document.

In [6]:
def term_document_matrix(data, vocab= None, document_index= 'ID', text= 'text'):
    """Calculate frequency of term in the document.
    
    parameter: 
        data: DataFrame. 
        Frequency of word calculated against the data.
        
        vocab: list of strings.
        Vocabulary of the documents    
        
        document_index: str.
        Column name for document index in DataFrame passed.
        
        text: str
        Column name containing text for all documents in DataFrame,
        
    returns:
        vocab_index: DataFrame.
        DataFrame containing term document matrix.
        """
    
    vocab_index = pd.DataFrame(columns=df[document_index], index= vocab).fillna(0)
    
    for word in vocab_index.index:
        
        for doc in data[document_index]:
            
            freq = data[data[document_index] == doc][text].values[0].count(word)
            vocab_index.loc[word,doc] = freq
    
    return vocab_index

In [7]:
similarity_index = term_document_matrix(df,vocab,'docID','tags')
similarity_index

docID,D0,D1,D2,D3,D4,D5,D6,D7,D8,D9,...,D44,D45,D46,D47,D48,D49,D50,D51,D52,D53
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
25,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
alone,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
animated,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
woman,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
women,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
wwe,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
yennefer,0,0,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0


From the term-document-matrix the inverse-document-frequncy is calculated for every document. Using the term-frequencies (tf) and the inverse-document-frequency (idf) the tf-idf score for every word in every document is computed below.

In [8]:
def tf_idf_score(vocab_index, document_index, inv_df= 'inverse_document_frequency'):
    """
    Calculate tf-idf score for vocabulary in documents
    
    parameter:
        vocab_index: DataFrame.
        Term document matrix.
        
        document_index: list or tuple.
        Series containing document ids.
        
        inv_df: str.
        Name of the column with calculated inverse document frequencies.
        
    returns:
        vocab_index: DataFrame.
        DataFrame containing term document matrix and document frequencies, inverse document frequencies and tf-idf scores
    """
    total_docx = len(document_index)
    vocab_index['document_frequency'] = vocab_index.sum(axis= 1)
    vocab_index['inverse_document_frequency'] = np.log2( total_docx / vocab_index['document_frequency'])
    
    for word in vocab_index.index:
        
        for doc in document_index:
            
                tf_idf = np.log2(1 + vocab_index.loc[word,doc]) * np.log2(vocab_index.loc[word][inv_df])
                vocab_index.loc[word,'tf_idf_'+doc] = tf_idf
    
    return vocab_index

In [9]:
similarity_index = tf_idf_score(similarity_index, df.docID.values)
similarity_index

docID,D0,D1,D2,D3,D4,D5,D6,D7,D8,D9,...,tf_idf_D44,tf_idf_D45,tf_idf_D46,tf_idf_D47,tf_idf_D48,tf_idf_D49,tf_idf_D50,tf_idf_D51,tf_idf_D52,tf_idf_D53
2,0,0,0,0,0,0,0,0,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.000000
25,0,0,0,0,0,0,0,0,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.000000
4,0,0,0,0,0,0,0,0,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.000000
alone,0,0,0,0,0,0,0,0,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.000000
animated,0,1,0,0,0,0,0,0,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
woman,0,0,0,0,0,0,0,0,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.524788,0.000000
women,0,0,0,0,0,0,0,0,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,2.524788
wwe,0,0,0,0,0,0,0,0,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.000000
yennefer,0,0,0,0,0,0,0,0,1,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.000000


The calculation of the above is computation intensive and is only required to do so once unless the database containing documents changes. Thus the above dataframe containing all tf, idfs and the tf-idf scores are saved.

In [10]:
similarity_index.to_csv('term_doc_matrix.csv')

In [11]:
test= pd.read_csv('term_doc_matrix.csv')
test = test.set_index('Unnamed: 0')

{'Unnamed: 0'}

A huge assumption which is true is that the user will not input the query in a set format. Hence on the same lines as cleaning the tags, the query is also cleaned :
* Remove punctutations
* Lower case
* Remove whitespaces
* Remove stop words


In [13]:
def query_processing(query):
    """
    Pre-processing query to accomodate calculations for tf-idf score
    
    parameter:
        query: str.
        Textual query input to the system.
        
    returns:
        query: str.
        Cleaned string.
        """
    query= re.sub('\W',' ',query)
    query= query.strip().lower()
    query= " ".join([word for word in query.split() if word not in stopwords.words('english')])
    
    return query

In [14]:
query = "25$ batman' is Woman"
query_processing(query)

'25 batman woman'

For every term in the query, if it exists in the vocabulary, then its tf-idf score is calculated and appended to the matrix.

In [15]:
def query_score(vocab_index, query):
    """
    Calculate tf-idf score for query terms
    
    parameter:
        vocab_index: DataFrame.
        Term document matrix with inverse document frequency and term frequencies calculated.
        
        query: str.
        Query submitted to the system
        
    returns:
        vocab_index: DataFrame.
        Term document matrix with tf-idf scores for terms per document and query terms.
    """
    for word in np.unique(query.split()):
        
        freq = query.count(word)
        
        if word in vocab_index.index:
            
            tf_idf = np.log2(1+freq) * np.log2(vocab_index.loc[word].inverse_document_frequency)
            vocab_index.loc[word,"query_tf_idf"] = tf_idf
            vocab_index['query_tf_idf'].fillna(0, inplace=True)
    
    return vocab_index

In [16]:
query= "25 batman alone woman"
similarity_index = query_score(test,query)
similarity_index

Unnamed: 0_level_0,D0,D1,D2,D3,D4,D5,D6,D7,D8,D9,...,tf_idf_D45,tf_idf_D46,tf_idf_D47,tf_idf_D48,tf_idf_D49,tf_idf_D50,tf_idf_D51,tf_idf_D52,tf_idf_D53,query_tf_idf
Unnamed: 0,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2,0,0,0,0,0,0,0,0,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.000000,0.000000
25,0,0,0,0,0,0,0,0,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.000000,2.524788
4,0,0,0,0,0,0,0,0,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.000000,0.000000
alone,0,0,0,0,0,0,0,0,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.000000,2.524788
animated,0,1,0,0,0,0,0,0,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.000000,0.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
woman,0,0,0,0,0,0,0,0,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.524788,0.000000,2.524788
women,0,0,0,0,0,0,0,0,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,2.524788,0.000000
wwe,0,0,0,0,0,0,0,0,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.000000,0.000000
yennefer,0,0,0,0,0,0,0,0,1,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.000000,0.000000


The last step is to find the cosine similarities between the query and documents. The cosine similarity determines how similar the two vectors are - Document vector and the query vector.

In [17]:
def cosine_similarity(vocab_index, document_index, query_scores):
    """
    Calculates cosine similarity between the documents and query
    
    parameter:
        
        vocab_index: DataFrame.
        DataFrame containing tf-idf score per term for every document and for the query terms.
        
        document_index: list.
        List of document ids.
        
        query_scores: str.
        Column name in DataFrame containing query term tf-idf scores.
        
    returns:
        cosine_scores: Series.
        Cosine similarity scores of every document.
    """
    cosine_scores = {}
    
    query_scalar = np.sqrt(sum(vocab_index[query_scores] ** 2))
    
    for doc in document_index:
        
        doc_scalar = np.sqrt(sum(vocab_index[doc] ** 2))
        dot_prod = sum(vocab_index[doc] * vocab_index[query_scores])
        cosine = (dot_prod / (query_scalar * doc_scalar))
        
        cosine_scores[doc] = cosine
        
    return pd.Series(cosine_scores)

In [18]:
cosines = cosine_similarity(similarity_index, df.docID.values, 'query_tf_idf')
cosines

D0     0.000000
D1     0.264088
D2     0.000000
D3     0.000000
D4     0.000000
D5     0.000000
D6     0.000000
D7     0.000000
D8     0.000000
D9     0.000000
D10    0.000000
D11    0.000000
D12    0.000000
D13    0.000000
D14    0.000000
D15    0.000000
D16    0.000000
D17    0.000000
D18    0.000000
D19    0.000000
D20    0.000000
D21    0.000000
D22    0.000000
D23    0.000000
D24    0.000000
D25    0.000000
D26    0.000000
D27    0.000000
D28    0.000000
D29    0.000000
D30    0.000000
D31    0.000000
D32    0.000000
D33    0.000000
D34    0.000000
D35    0.000000
D36    0.209599
D37    0.229604
D38    0.000000
D39    0.000000
D40    0.000000
D41    0.000000
D42    0.000000
D43    0.000000
D44    0.000000
D45    0.000000
D46    0.000000
D47    0.000000
D48    0.000000
D49    0.000000
D50    0.000000
D51    0.000000
D52    0.229604
D53    0.000000
dtype: float64

Once the cosine score for every document with the query is calculated. The documents are ranked with respect to their score. The top 'k' documents, here 10, are retrieved in the form of indices.

In [19]:
def retrieve_index(data,cosine_scores, document_index):
    """
    Retrieves indices for the corresponding document cosine scores
    
    parameters:
        data: DataFrame.
        DataFrame containing document ids and text.
        
        cosine_scores: Series.
        Series containing document cosine scores.
        
        document_index: str.
        Column name containing document ids in data.
        
    returns:
        data: DataFrame.
        Original DataFrame with cosine scores added as column.
    """
    
    data = data.set_index(document_index)
    data['scores'] = cosine_scores
    
    return data.reset_index().sort_values('scores',ascending=False).head(10).index


In [20]:
indices = retrieve_index(df, cosines, 'docID')
indices

Int64Index([1, 52, 37, 36, 0, 39, 29, 30, 31, 32], dtype='int64')

These indices are the top 10 most relevant images according to the query.<br>
The function below, summarizes, by calling, all the above written individual functions. Only the below functions needs to be called to run the system and retrieve indices for a query.

In [21]:
def information_system(query):
    """
    Perform a retrieval from the indexes based on the query 
    and return the document ids that are similar to the query
    
    paramters:
        query: str.
        Query submitted to the system.
        
    returns:
        indices: list.
        List of document indices which are most relevant to the query.
    """
    
    df = pd.read_csv('/home/hp/Github/Information-Retrieval-System/TagsDatabase.csv',header=None)

    df.columns = ['docID','tags']
    df.docID = pd.Series(["D"+str(ind) for ind in df.docID])

    df.tags = df.tags.str.replace(","," ")
    df.tags = df.tags.str.replace(r'\W',' ')
    df.tags = df.tags.str.strip().str.lower()
    
    if not path.exists('term_doc_matrix.csv'):    

        all_text = " ".join(df.tags.values)
        vocab = np.unique(word_tokenize(all_text))
        vocab = [word for word in vocab if word not in stopwords.words('english')]

        similarity_index = term_document_matrix(df,vocab,'docID','tags')
        similarity_index = tf_idf_score(similarity_index, df.docID.values)
        
    else:
        similarity_index = pd.read_csv('term_doc_matrix.csv')
        similarity_index = similarity_index.set_index('Unnamed: 0')
        
    query = query_processing(query)
    similarity_index = query_score(similarity_index,query)
    
    cosines = cosine_similarity(similarity_index, df.docID.values, 'query_tf_idf')
    indices = retrieve_index(df, cosines, 'docID')
    
    return list(indices)

In [22]:
information_system('25 batman tom')

[1, 36, 3, 16, 40, 30, 31, 32, 33, 34]

The information retrieval system using tf-idf scores and cosine similarities is quite useful and can help in answering information need. The drawback of this method is that it doesnot consider the context of the query (topic) and hence sometimes the results even though relevant with respect to the terms (words) can be off topic.<br>

The script for the code is attached in the repo. A small Flask server has been coded (basic) to utilize the information system remotely via GET requests.