<a href="https://colab.research.google.com/github/zakaria-aabbou/NLP_based_information_retrieval_system/blob/main/Final_project_information_retrieval_system.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<h1 style="text-align:center;font-size: 3em"> Project </h1>

<p style="text-align:left;font-size: 1.3em">
The information retrieval methods are needed to find the most relevant documents to
a given query. The words contained in the web pages can be modeled using different
approaches such as Boolean models, vector space models, and probabilistic models.
In this project, we have decided to use the vector space models and particularly the
Doc2Vec (or word2vec) technique.
 </p>
<p style="text-align:left;font-size: 1.3em">
This project aims at developing an information retrieval system based on the word
embedding technique “Doc2Vec (or word2vec)”. The documents and the query will be
represented by embedding vectors. The similarity between the query vector and each
document will be computed using cosine similarity measure. Furthermore, to measure
the effectiveness of this information retrieval system, you might use the TREC test collection
(dataset) available on this website:
 </p>
 <a href = 'https://trec.nist.gov/data.html'> https://trec.nist.gov/data.html </a>

Tasks:
- Information retrieval system
- Query ==> most relevant documents
- Vector Space Models ==> Word Embedding technique *Doc2Vec* or *word2vec*
- Documents + Query = Vectors
- Cosine_similarity(Query , each document )
- use the TREC test collection (dataset) to measure the effectiveness of this information retrieval system

***

In this project we will use the document ranking dataset from **TREC 2019 Deep Learning Track**. The dataset contains  367k queries and a corpus of 3.2 million documents.

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


# Import data

In [2]:
import pandas as pd
import numpy as np
pd.options.mode.chained_assignment = None  # default='warn'

#### Import queries

In [4]:
path = '/content/drive/MyDrive/Colab Notebooks/Data/web_mining_project/'

In [5]:
queries = pd.read_csv(path + 'queries.csv')
print('Shape of the queries :',queries.shape)
queries.head()

Shape of the queries : (2000, 2)


Unnamed: 0,qid,query
0,687888,what is a jpe
1,480210,price for asphalt driveway
2,591004,what causes pressure skin bruising
3,260536,how long drive from flagstaff to grand canyon
4,39422,average number of bowel movements per day for ...


#### Creating Training Set of Queries

In [6]:
training_queries=queries.iloc[:1000]
print('Shape of the Training Set of Queries :',training_queries.shape)
training_queries.head()

Shape of the Training Set of Queries : (1000, 2)


Unnamed: 0,qid,query
0,687888,what is a jpe
1,480210,price for asphalt driveway
2,591004,what causes pressure skin bruising
3,260536,how long drive from flagstaff to grand canyon
4,39422,average number of bowel movements per day for ...


#### Creating Testing Set of Queries

In [7]:
testing_queries=queries.iloc[1000:]
print('Shape of Testing Set of Queries',testing_queries.shape)
testing_queries.head()

Shape of Testing Set of Queries (1000, 2)


Unnamed: 0,qid,query
1000,807599,what is the axis mundi?
1001,990945,where is pratt kansas
1002,48210,backordered definition
1003,894254,what show did simon baker play in
1004,165579,does dna replication in mitosis and meiosis


#### Load the training data containing the top 100 documents for each query

In [8]:
train_top100 = pd.read_csv(path + 'training_ranked100.csv')

In [9]:
# Reducing train_top100 for training
training_ranked100=train_top100[train_top100['qid'].isin(training_queries['qid'].unique())].reset_index(drop=True)
print('Shape of training_ranked100: ',training_ranked100.shape)
training_ranked100.head()

Shape of training_ranked100:  (100000, 6)


Unnamed: 0,qid,Q0,docid,rank,score,runstring
0,310290,Q0,D579750,1,-5.11498,IndriQueryLikelihood
1,310290,Q0,D579754,2,-5.57703,IndriQueryLikelihood
2,310290,Q0,D2380815,3,-5.84852,IndriQueryLikelihood
3,310290,Q0,D822566,4,-5.95002,IndriQueryLikelihood
4,310290,Q0,D2249695,5,-6.08326,IndriQueryLikelihood


#### Load the testing data containing the top 100 documents for each query

In [10]:
test_top100 = pd.read_csv(path + 'testing_ranked100.csv')

In [11]:
# Reducing train_top100 for testing
testing_ranked100=test_top100[test_top100['qid'].isin(testing_queries['qid'].unique())].reset_index(drop=True)
print('Shape of testing_ranked100 : ',testing_ranked100.shape)
testing_ranked100.head()

Shape of testing_ranked100 :  (100000, 6)


Unnamed: 0,qid,Q0,docid,rank,score,runstring
0,1164761,Q0,D3261512,1,-4.9232,IndriQueryLikelihood
1,1164761,Q0,D1529569,2,-5.01292,IndriQueryLikelihood
2,1164761,Q0,D3444265,3,-5.03616,IndriQueryLikelihood
3,1164761,Q0,D1313045,4,-5.09482,IndriQueryLikelihood
4,1164761,Q0,D1058999,5,-5.18936,IndriQueryLikelihood


#### Labelling Top 10 documents as 1 and last 10 as 0

we will label the documents at rank 1 to 10 as relevant(1) and from 91 to 100 as non-relevant(0). Doing this will benefit us in two ways. First, it will reduce the dataset, and second, it will act as a ground truth on which we’ll evaluate our method later.

In [12]:
rel=list(range(1,11))
nonrel=list(range(91,101))
training_ranked100['rel']=training_ranked100['rank'].apply(lambda x: 1 if x in rel else ( 0 if x in nonrel else np.nan))
testing_ranked100['rel']=testing_ranked100['rank'].apply(lambda x: 1 if x in rel else ( 0 if x in nonrel else np.nan))

In [13]:
# Result set for Training
training_result=training_ranked100.dropna()
training_result['rel']=training_result['rel'].astype(int)
print('Shape=>',training_result.shape)
training_result.head()

Shape=> (20000, 7)


Unnamed: 0,qid,Q0,docid,rank,score,runstring,rel
0,310290,Q0,D579750,1,-5.11498,IndriQueryLikelihood,1
1,310290,Q0,D579754,2,-5.57703,IndriQueryLikelihood,1
2,310290,Q0,D2380815,3,-5.84852,IndriQueryLikelihood,1
3,310290,Q0,D822566,4,-5.95002,IndriQueryLikelihood,1
4,310290,Q0,D2249695,5,-6.08326,IndriQueryLikelihood,1


In [14]:
# Result set for Testing
testing_result=testing_ranked100.dropna()
testing_result['rel']=testing_result['rel'].astype(int)
print('Shape=>',testing_result.shape)
testing_result.head()

Shape=> (20000, 7)


Unnamed: 0,qid,Q0,docid,rank,score,runstring,rel
0,1164761,Q0,D3261512,1,-4.9232,IndriQueryLikelihood,1
1,1164761,Q0,D1529569,2,-5.01292,IndriQueryLikelihood,1
2,1164761,Q0,D3444265,3,-5.03616,IndriQueryLikelihood,1
3,1164761,Q0,D1313045,4,-5.09482,IndriQueryLikelihood,1
4,1164761,Q0,D1058999,5,-5.18936,IndriQueryLikelihood,1


#### Training corpus

In [15]:
training_corpus = pd.read_csv(path + 'training_corpus.csv')
print('Shape=>',training_corpus.shape)
training_corpus.head()

Shape=> (19505, 3)


Unnamed: 0,docid,title,body
0,D297612,"Fair Oaks, CA County Of Sacramento","Home Fair Oaks, CA County Of Sacramento Fair O..."
1,D1036761,What airport is the closest to downtown London?,Answers.com ® Wiki Answers ® Categories Travel...
2,D2025493,"York County, South Carolina Genealogy",navigation search United States South Carolina...
3,D2214523,The Natural Habitat of Wolves,"Wolves are members of the canine family, but t..."
4,D1881859,Franking Privilege Law and Legal Definition,Franking Privilege Law and Legal Definition Fr...


#### Testing corpus

In [16]:
testing_corpus = pd.read_csv(path + 'testing_corpus.csv')
print('Shape=>',testing_corpus.shape)
testing_corpus.head()

Shape=> (19570, 3)


Unnamed: 0,docid,title,body
0,D1911483,My Doctor Online The Permanente Medical Group,Abnormal Vaginal Bleeding in Midlife and Beyon...
1,D2378859,How many decibels can a human hear in?,Answers.com ® Wiki Answers ® Categories Scienc...
2,D2981241,What do you call a group of lions?,Lions Vocabulary of the English Language Word ...
3,D2337005,Peripheral Vascular Surgery-Chapter 32,165 terms alexandriamartinez19Peripheral Vascu...
4,D2078142,What are all the literary devices? List them p...,Education & Reference Homework Help What are a...


Now, we have our datasets ready for further processes.

# Data Exploration

We’ll take a sample from the corpus and look at the data we have.

In [17]:
temp_doc=training_corpus.sample(1)
print('Title=>',temp_doc.title.values)
print('Body:\n',temp_doc.body.values)

Title=> ['Why do we need ip addresses?']
Body:
 ["TCP/IP IP Addresses Computer Networking The Internet Computers Why do we need ip addresses?ad by Datadog HQ.com Datadog: cloud monitoring as a service. Track your dynamic infrastructure with Datadog's cloud-scale monitoring. Start your free trial now. Learn More at datadoghq.com10 Answers Manoj Papisetty, CCNP R&S and CCIE DC. Cisco TAC. Answered Oct 30, 2015 · Author has 51 answers and 32k answer views Two things which can identify your computer on the internet - IP address and MAC address. Comparing to real world, MAC address is just like you. It is unique across the globe and is reserved for your computer just like you are unique among so many people in the world. IP Address is like your contact address. If people write a letter to you, they need an address to post. Similarly in computer world, IP address is the address to which information is sent to, to reach your computer. Just like how addresses in real world are organized geaogr

Let’s take a look at some queries.

In [18]:
for i,v in enumerate(training_queries['query'].sample(10)):
    print(i,'=>',v)

0 => what are treatment options for autism
1 => synonym depth
2 => is massage good for a groin injury
3 => where is nicki minaj originally from
4 => which sentences describe characteristics of a sole proprietorship?
5 => what is wabi
6 => what is diazepam 10mg
7 => what is the star wars galaxy called
8 => in which role does the president serve as head of the military
9 => how to start gypsophila elegans seeds


# Text Preprocessing

The pre-processing steps we’ll be performing on documents and queries are as follows:

Documents:
- Lowercase the text
- Expand Contractions
- Clean the text
- Remove Stopwords
- Lemmatize words


Queries:
- Lowercase the text
- Expand Contractions
- Clean the text

In [19]:
import re

# Lowercasing the text
training_corpus['cleaned']=training_corpus['body'].apply(lambda x:x.lower())
testing_corpus['cleaned']=testing_corpus['body'].apply(lambda x:x.lower())

# Dictionary of english Contractions
contractions_dict = { "ain't": "are not","'s":" is","aren't": "are not","can't": "can not","can't've": "cannot have",
"'cause": "because","could've": "could have","couldn't": "could not","couldn't've": "could not have",
"didn't": "did not","doesn't": "does not","don't": "do not","hadn't": "had not","hadn't've": "had not have",
"hasn't": "has not","haven't": "have not","he'd": "he would","he'd've": "he would have","he'll": "he will",
"he'll've": "he will have","how'd": "how did","how'd'y": "how do you","how'll": "how will","i'd": "i would",
"i'd've": "i would have","i'll": "i will","i'll've": "i will have","i'm": "i am","i've": "i have",
"isn't": "is not","it'd": "it would","it'd've": "it would have","it'll": "it will","it'll've": "it will have",
"let's": "let us","ma'am": "madam","mayn't": "may not","might've": "might have","mightn't": "might not",
"mightn't've": "might not have","must've": "must have","mustn't": "must not","mustn't've": "must not have",
"needn't": "need not","needn't've": "need not have","o'clock": "of the clock","oughtn't": "ought not",
"oughtn't've": "ought not have","shan't": "shall not","sha'n't": "shall not",
"shan't've": "shall not have","she'd": "she would","she'd've": "she would have","she'll": "she will",
"she'll've": "she will have","should've": "should have","shouldn't": "should not",
"shouldn't've": "should not have","so've": "so have","that'd": "that would","that'd've": "that would have",
"there'd": "there would","there'd've": "there would have",
"they'd": "they would","they'd've": "they would have","they'll": "they will","they'll've": "they will have",
"they're": "they are","they've": "they have","to've": "to have","wasn't": "was not","we'd": "we would",
"we'd've": "we would have","we'll": "we will","we'll've": "we will have","we're": "we are","we've": "we have",
"weren't": "were not","what'll": "what will","what'll've": "what will have","what're": "what are",
"what've": "what have","when've": "when have","where'd": "where did",
"where've": "where have","who'll": "who will","who'll've": "who will have","who've": "who have",
"why've": "why have","will've": "will have","won't": "will not","won't've": "will not have",
"would've": "would have","wouldn't": "would not","wouldn't've": "would not have","y'all": "you all",
"y'all'd": "you all would","y'all'd've": "you all would have","y'all're": "you all are","y'all've": "you all have",
"you'd": "you would","you'd've": "you would have","you'll": "you will","you'll've": "you will have",
"you're": "you are","you've": "you have"}

# Regular expression for finding contractions
contractions_re=re.compile('(%s)' % '|'.join(contractions_dict.keys()))

# Function for expanding contractions
def expand_contractions(text,contractions_dict=contractions_dict):
    def replace(match):
        return contractions_dict[match.group(0)]
    return contractions_re.sub(replace, text)

# Expanding Contractions
training_corpus['cleaned']=training_corpus['cleaned'].apply(lambda x:expand_contractions(x))
testing_corpus['cleaned']=testing_corpus['cleaned'].apply(lambda x:expand_contractions(x))

For cleaning the documents, we have created a function clean_text() which will remove the words with digits, replace newline characters with space, remove URLs, and replace everything that isn’t English alphabets with space.

In [20]:
# Function for Cleaning Text
def clean_text(text):
    text=re.sub('\w*\d\w*','', text)
    text=re.sub('\n',' ',text)
    text=re.sub(r"http\S+", "", text)
    text=re.sub('[^a-z]',' ',text)
    return text
 
# Cleaning corpus using RegEx
training_corpus['cleaned']=training_corpus['cleaned'].apply(lambda x: clean_text(x))
testing_corpus['cleaned']=testing_corpus['cleaned'].apply(lambda x: clean_text(x))

We’ll reduce the number of spaces to one.

In [21]:
# Removing extra spaces
training_corpus['cleaned']=training_corpus['cleaned'].apply(lambda x: re.sub(' +',' ',x))
testing_corpus['cleaned']=testing_corpus['cleaned'].apply(lambda x: re.sub(' +',' ',x))

Now, we will remove the stopwords from documents and lemmatize it. For this, we’ll be using SpaCy. 

In [22]:
# Stopwords removal & Lemmatizing tokens using SpaCy
import spacy
from tqdm import tqdm
nlp = spacy.load('en_core_web_sm',disable=['ner','parser'])
#nlp.max_length=5000000

# Removing Stopwords and Lemmatizing words
training_corpus['lemmatized']=training_corpus['cleaned'].apply(lambda x: ' '.join([token.lemma_ for token in list(nlp(x)) if (token.is_stop==False)]))
testing_corpus['lemmatized']=testing_corpus['cleaned'].apply(lambda x: ' '.join([token.lemma_ for token in list(nlp(x)) if (token.is_stop==False)]))


ValueError: ignored

We have pre-processed the documents. It’s time to pre-process the queries.

In [23]:
# Lowercasing the text
training_queries['cleaned']=training_queries['query'].apply(lambda x:x.lower())
testing_queries['cleaned']=testing_queries['query'].apply(lambda x:x.lower())

# Expanding contractions
training_queries['cleaned']=training_queries['cleaned'].apply(lambda x:expand_contractions(x))
testing_queries['cleaned']=testing_queries['cleaned'].apply(lambda x:expand_contractions(x))

# Cleaning queries using RegEx
training_queries['cleaned']=training_queries['cleaned'].apply(lambda x: clean_text(x))
testing_queries['cleaned']=testing_queries['cleaned'].apply(lambda x: clean_text(x))

# Removing extra spaces
training_queries['cleaned']=training_queries['cleaned'].apply(lambda x: re.sub(' +',' ',x))
testing_queries['cleaned']=testing_queries['cleaned'].apply(lambda x: re.sub(' +',' ',x))

# Creating Vectors

First, we’ll prepare the dataset for training the word2vec model.

In [24]:
# Combining corpus and queries for training
combined_training=pd.concat([training_corpus.rename(columns={'cleaned':'text'})['text'],\
                             training_queries.rename(columns={'cleaned':'text'})['text']])\
                             .sample(frac=1).reset_index(drop=True)

Now we’ll train our word2vec model with gensim.

In [25]:
from gensim.models import Word2Vec

# Creating data for the model training
train_data=[]
for i in combined_training:
    train_data.append(i.split())

# Training a word2vec model from the given data set
w2v_model = Word2Vec(train_data, size=300, min_count=2,window=5, sg=1,workers=4)

KeyboardInterrupt: ignored

Save the model in order to use it later.

In [None]:
from gensim.models import Word2Vec
w2v_model.save("word2vec.model")
w2v_model = Word2Vec.load("word2vec.model")

In [None]:
# Vocabulary size
print('Vocabulary size:', len(w2v_model.wv.vocab))

Since the word2vec provides vectors for a word, we’ll create a function get_embedding_w2v() for generating vectors for the whole document or query. This function will use the word2vec model and generate the vectors for each word in the document.

In [None]:
# Function returning vector reperesentation of a document
def get_embedding_w2v(doc_tokens):
    embeddings = []
    if len(doc_tokens)<1:
        return np.zeros(300)
    else:
        for tok in doc_tokens:
            if tok in w2v_model.wv.vocab:
                embeddings.append(w2v_model.wv.word_vec(tok))
            else:
                embeddings.append(np.random.rand(300))
        # mean the vectors of individual words to get the vector of the document
        return np.mean(embeddings, axis=0)

# Getting Word2Vec Vectors for Testing Corpus and Queries
testing_corpus['vector']=testing_corpus['cleaned'].apply(lambda x :get_embedding_w2v(x.split()))
testing_queries['vector']=testing_queries['cleaned'].apply(lambda x :get_embedding_w2v(x.split()))

# Ranking & Evaluation

We have successfully trained our word2vec model and created vectors for documents and queries in the testing set for information retrieval. Now, it’s time to rank the documents according to the queries.

For the ranking and evaluation, we have created a function average_precision(), which takes the query id and vector of a query as an input and returns the average precision value

In [None]:
from sklearn.metrics.pairwise import cosine_similarity

# Function for calculating average precision for a query
def average_precision(qid,qvector):
  
    # Getting the ground truth and document vectors
    qresult=testing_result.loc[testing_result['qid']==qid,['docid','rel']]
    qcorpus=testing_corpus.loc[testing_corpus['docid'].isin(qresult['docid']),['docid','vector']]
    qresult=pd.merge(qresult,qcorpus,on='docid')
  
    # Ranking documents for the query
    qresult['similarity']=qresult['vector'].apply(lambda x: cosine_similarity(np.array(qvector).reshape(1, -1),np.array(x).reshape(1, -1)).item())
    qresult.sort_values(by='similarity',ascending=False,inplace=True)

    # Taking Top 10 documents for the evaluation
    ranking=qresult.head(10)['rel'].values
  
    # Calculating precision
    precision=[]
    for i in range(1,11):
        if ranking[i-1]:
            precision.append(np.sum(ranking[:i])/i)
  
    # If no relevant document in list then return 0
    if precision==[]:
        return 0

    return np.mean(precision)

# Calculating average precision for all queries in the test set
testing_queries['AP']=testing_queries.apply(lambda x: average_precision(x['qid'],x['vector']),axis=1)

# Finding Mean Average Precision
print('Mean Average Precision=>',testing_queries['AP'].mean())

The value of MAP ranges between 0 and 1, with zero being the worst and one as best. Our information retrieval model performs well in the evaluation with a value of 0.807.

# Final IR system

We’ll create a function rank() that will take a query as an input and return the top 10 relevant documents. This function will follow the information retrieval(IR) pipeline. First, it will pre-process the query. Then, it will generate the vector for it. After that, it will rank the documents based on the similarity scores.

In [None]:
def rank(query):

    # pre-process Query
    query=query.lower()
    query=expand_contractions(query)
    query=clean_text(query)
    query=re.sub(' +',' ',query)

    # generating vector
    vector=get_embedding_w2v(query.split())

    # ranking documents
    documents=testing_corpus[['docid','title','body']].copy()
    documents['similarity']=testing_corpus['vector'].apply(lambda x: cosine_similarity(np.array(vector).reshape(1, -1),np.array(x).reshape(1, -1)).item())
    documents.sort_values(by='similarity',ascending=False,inplace=True)

    return documents.head(10).reset_index(drop=True)

We have now created our function. Let’s run some queries on our system.

In [None]:
rank('Lebron James')

In [None]:
rank('President Donald Trump')