<a href="https://colab.research.google.com/github/telsayed/IR-in-Arabic/blob/master/Summer2021/labs/day5/IR_in_Arabic_Lab5_LanguageModels.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>



# **IR in Arabic** - Summer 2021 lab day5



This is one of a series of Colab notebooks created for the **IR in Arabic** course. It demonstrates how we can perform ranked retrieval using a language model and evaluate the output of multiple retrieval models.

The **learning outcomes** of the this notebook are:


*   Retrieval using a language model with Jelineck-Mercer smoothing.
*   Evaluate and compare the results of multiple retrieval models.


### **Setup**
We will first install Pyterrier as follows:

In [None]:
#install the Pyterrier framework
!pip install python-terrier

The next step is to initialise PyTerrier. This is performed using PyTerrier's init() method. The init() method is needed as PyTerrier must download Terrier's jar file and start the Java virtual machine. We prevent init() from being called more than once by checking started().

In [None]:
import pyterrier as pt
if not pt.started():
  pt.init()

Another library that we need for this lab is Arabic-Stopwords

In [None]:
#install the Arabic stop words library
!pip install Arabic-Stopwords

We will import all the python libraries needed for this lab

In [None]:
#we need to import the following libraries.
import pandas as pd
#to display the full text on the notebook without truncation
pd.set_option('display.max_colwidth', 150)
import numpy as np
import re
from snowballstemmer import stemmer
from tqdm import tqdm
import arabicstopwords.arabicstopwords as stp

We will prepare our helper functions for removing stop words, normalize, and stemming which we will use to process our queries.

In [None]:
#removing Stop Words function
def remove_stop_words(sentence):
    terms=[]
    stopWords= set(stp.stopwords_list())
    for term in sentence.split() : 
        if term not in stopWords :
           terms.append(term)
    return " ".join(terms)

#a function to normalize the tweets
def normalize(text):
    text = re.sub("[إأٱآا]", "ا", text)
    text = re.sub("ى", "ي", text)
    text = re.sub("ؤ", "ء", text)
    text = re.sub("ئ", "ء", text)
    text = re.sub("ة", "ه", text)
    return(text)

#define the stemming function
ar_stemmer = stemmer("arabic")
def stem(sentence):
    return " ".join([ar_stemmer.stemWord(i) for i in sentence.split()])
  

# perform all preprocessing steps
def preprocess(sentence):
  # apply preprocessing steps on the given sentence
  sentence =remove_stop_words(sentence)
  sentence =normalize(sentence)
  sentence =stem(sentence)
  return sentence

We will use our indexed **EveTAR** dataset. The index is uploaded in our Github repository so we will access it as follows:

In [None]:
%rm -rf IR-in-Arabic
%rm -rf evetarIndex
!git clone https://github.com/telsayed/IR-in-Arabic.git 
!unzip IR-in-Arabic/Summer2021/data/EveTAR/evetarIndex.zip -d evetarIndex
!ls evetarIndex

Next, we will load our index.

In [None]:
#we will load the index
index_ref = pt.IndexRef.of("./evetarIndex/data.properties")
index = pt.IndexFactory.of(index_ref)

Let's load our collection.

In [None]:
dataset_links=["https://raw.githubusercontent.com/telsayed/IR-in-Arabic/master/Summer2021/data/EveTAR/tweets/evetar-q-01.txt",
               "https://raw.githubusercontent.com/telsayed/IR-in-Arabic/master/Summer2021/data/EveTAR/tweets/evetar-q-02.txt",
               "https://raw.githubusercontent.com/telsayed/IR-in-Arabic/master/Summer2021/data/EveTAR/tweets/evetar-q-03.txt",
               "https://raw.githubusercontent.com/telsayed/IR-in-Arabic/master/Summer2021/data/EveTAR/tweets/evetar-q-04.txt",
               "https://raw.githubusercontent.com/telsayed/IR-in-Arabic/master/Summer2021/data/EveTAR/tweets/evetar-q-05.txt",
               "https://raw.githubusercontent.com/telsayed/IR-in-Arabic/master/Summer2021/data/EveTAR/tweets/evetar-q-06.txt",
               "https://raw.githubusercontent.com/telsayed/IR-in-Arabic/master/Summer2021/data/EveTAR/tweets/evetar-q-07.txt",
               "https://raw.githubusercontent.com/telsayed/IR-in-Arabic/master/Summer2021/data/EveTAR/tweets/evetar-q-08.txt",
               "https://raw.githubusercontent.com/telsayed/IR-in-Arabic/master/Summer2021/data/EveTAR/tweets/evetar-q-09.txt",
               "https://raw.githubusercontent.com/telsayed/IR-in-Arabic/master/Summer2021/data/EveTAR/tweets/evetar-q-10.txt"]

full_data=pd.DataFrame()
for i in tqdm(range(len(dataset_links))):
    tweets=pd.read_csv(dataset_links[i], sep='\t')
    full_data=pd.concat([full_data,tweets],ignore_index=True)
full_data.reset_index(inplace=True,drop=True)
#the docno will be our tweetID
full_data["docno"]=full_data["tweetID"].astype(str)
full_data

We will use load queries that are already defined and released with EveTAR dataset and process using the same processing steps we did when we indexed EveTAR.

In [None]:
#read the queries file from Github
topics=pd.read_csv("https://raw.githubusercontent.com/telsayed/IR-in-Arabic/master/Summer2021/data/EveTAR/topics.txt", sep='\t',names=['data'])
queries=[]
qid=[]
#we will get the queries and their ids from the topics file
for i in range(len(topics)):
    splitted=topics["data"][i:i+1][i].split(" ")
    if splitted[0]=="<title>":
       queries.append(' '.join(splitted[1:]))
    if splitted[0]=="<num>":
       qid.append(splitted[2])
queriesDF=pd.DataFrame() 
#the queries datframe should have qid and query columns to retrieve using PyTerrier      
queriesDF["qid"]=qid
queriesDF["raw_query"]=queries
#remove the stopwords from queries, do normalization, and apply stemming 
queriesDF["query"]=queriesDF["raw_query"].apply(preprocess)
queriesDF

**Retrieval with a language model using Jelinek-Mercer smoothing**


We will use BatchRetrieve Pyterrier class for retrieval and  **Hiemstra LM weighting model** which supports **Jelinek-Mercer smoothing** as the weighting model. You can check the weighting models supported by PyTerrier [here](http://terrier.org/docs/current/javadoc/org/terrier/matching/models/package-summary.html).

Lambda parameter for Jelinek-Mercer smoothing is set to 0.15 by default.

In [None]:
#set up our retieval model by specifing Hiemstra_LM as wmodel and limiting the number of results for each query top 10 documents
JM_retr = pt.BatchRetrieve(index,wmodel="Hiemstra_LM",num_results=10)

Let's try a single query.

In [None]:
query="مباراة العراق وكوريا الجنوبية في نصف نهائي كأس آسيا"
#we need to process the query also as we did for documents
query = preprocess(query)
#we will call the search function using our retrieval model we set up above
results=JM_retr.search(query)
if len(results)==0:
   print("There are no relevant documents for your selected query.")
else:
   print(results)

In [None]:
#Let's check the tweets text for the top 5 retrieved tweets
full_data[full_data['docno'].isin(results['docno'].loc[0:4].tolist())]

Let's update **Lambda** and set it to 0.95

* **High value of λ:**“conjunctive-like” search –tends to retrieve documents containing all query words.
*  **Low value of λ:** more disjunctive, suitable for long queries
* Correctly setting λ is very important for good performance


In [None]:
#set up our retieval model by specifing Hiemstra_LM as wmodel and limiting the number of results for each query top 10 documents
JM_retr_highLambda = pt.BatchRetrieve(index,wmodel="Hiemstra_LM",controls ={"c":0.95},num_results=10)

In [None]:
query="مباراة العراق وكوريا الجنوبية في نصف نهائي كأس آسيا"
#we need to process the query also as we did for documents
query = preprocess(query)
#we will call the search function using our retrieval model we set up above
results=JM_retr_highLambda.search(query)
if len(results)==0:
   print("There are no relevant documents for your selected query.")
else:
   print(results)

Let's check the text of top 5 retrieved tweets.

In [None]:
full_data[full_data['docno'].isin(results['docno'].loc[0:4].tolist())]

We can retrieve the relevant documents to a set of queries. We will use the set of queries we prepared earlier.

In [None]:
#RetrIEve using the Jelinek-Mercer smoothing where lambda=0.15 (default)
JM_res=JM_retr.transform(queriesDF)
JM_res

In [None]:
#Retreive using the Jelinek-Mercer smoothing where lambda=0.95
JM_highLambda_res=JM_retr_highLambda.transform(queriesDF)
JM_highLambda_res

### **Evaluating our results** 
To evaluate the results we need to load our qrels (relevance judgements) 

In [None]:
qrels=pd.read_csv("https://raw.githubusercontent.com/telsayed/IR-in-Arabic/master/Summer2021/data/EveTAR/qrels.txt", sep='\t',names=['qid','Q0','docno','label'])
qrels['docno']=qrels['docno'].astype(str)
qrels = qrels[qrels["docno"].isin(full_data["docno"].tolist())]
qrels

To evaluate our results we can use Pyterrier Utils.evaluate function. This function take the results and the qrels dataframe containing three columns which are **qid, docno, label.**







In [None]:
eval = pt.Utils.evaluate(JM_res,qrels[['qid','docno','label']],metrics=["map","recall","P"])
eval

In [None]:
eval = pt.Utils.evaluate(JM_highLambda_res,qrels[['qid','docno','label']],metrics=["map","recall","P"])
eval

### **How to compare between different retrieval models using PyTerrier.**

Pyterrier make it easy for us to compare between different retrieval models. Let's compare between different JM models with 3 different values of lambda.

In [None]:
JM_retr = pt.BatchRetrieve(index,wmodel="Hiemstra_LM",controls ={"c":0.15},num_results=1000)
JM_retr_highLambda = pt.BatchRetrieve(index,wmodel="Hiemstra_LM",controls ={"c":0.95},num_results=1000)
JM_retr_lowLambda = pt.BatchRetrieve(index,wmodel="Hiemstra_LM",controls ={"c":0.01},num_results=1000)
#call pt.Experiment
pt.Experiment(
[JM_retr ,JM_retr_highLambda, JM_retr_lowLambda],
queriesDF,
qrels,
eval_metrics=["map", "P"], 
names=["JM_retr","JM_retr_highLambda","JM_retr_lowLambda"]
)


Pyterrier make it easy for us to compare between different retrieval models. Let's compare between the previous lab models (BM25, TF_IDF) and today's model JM.




In [None]:
JM_retr = pt.BatchRetrieve(index,wmodel="Hiemstra_LM",controls ={"c":0.15},num_results=1000)
JM_retr_highLambda = pt.BatchRetrieve(index,wmodel="Hiemstra_LM",controls ={"c":0.95},num_results=1000)
bm25_retr = pt.BatchRetrieve(index, controls = {"wmodel": "BM25"},num_results=1000)
tfidf_retr = pt.BatchRetrieve(index, controls = {"wmodel": "TF_IDF"},num_results=1000)

pt.Experiment(
[JM_retr,JM_retr_highLambda ,bm25_retr, tfidf_retr],
queriesDF,
qrels,
eval_metrics=["map", "P"], 
names=["JM_retr","JM_retr_highLambda","bm25_retr","tfidf_retr"]
)

Other useful parameters:
1.   **dataframe(bool)**: If True return results as a dataframe. Else as a dictionary of dictionaries. Default=True.
2.   **perquery(bool)**: If True return each metric for each query, else return mean metricsacross all queries. Default=False

In [None]:
pt.Experiment(
[JM_retr,JM_retr_highLambda ,bm25_retr, tfidf_retr],
queriesDF,
qrels,
eval_metrics=["map", "P"], 
names=["JM_retr","JM_retr_highLambda","bm25_retr","tfidf_retr"],
perquery=True
)

In [None]:
pt.Experiment(
[JM_retr,JM_retr_highLambda ,bm25_retr, tfidf_retr],
queriesDF,
qrels,
eval_metrics=["map", "P"], 
names=["JM_retr","JM_retr_highLambda","bm25_retr","tfidf_retr"],
dataframe=False
)

## **Exercise1**
Use the first 10 queries from the set of our queries to retrieve the top 100 potentially relevant documents using both BM25 and language model with JM smoothing by setting Lambda to 0.8. Compare between the results in terms of map only.

In [None]:
#add your solution here

## **Exercise 2**
Given the following queries:

['E14' 'E48' 'E36' 'E58' 'E19' 'E63' 'E30' 'E27' 'E39' 'E21']

1. Retrieve the top 1000 relevant documents using the language model with JM smoothing by setting lambda to 0.9.
2. Retrieve the text for both queries and documents and make them into one dataframe.
3. Save the resulted dataframe into a text file.


In [None]:
selected_queries = ['E14','E48', 'E36', 'E58', 'E19', 'E63', 'E30', 'E27', 'E39', 'E21']
# write your solution here

### **Exercise 3**

Explore the weighting models supported by PyTerrier [here](http://terrier.org/docs/current/javadoc/org/terrier/matching/models/package-summary.html). Choose multiple models to retrieve the relevant tweets for the full set of queries and compare between the results in terms of map and precision.

## **References**


* [PyTerrier  retrieval and evaluation notebook](https://github.com/terrier-org/pyterrier/blob/master/examples/notebooks/retrieval_and_evaluation.ipynb).
*   [PyTerrier documentation.](https://pyterrier.readthedocs.io/_/downloads/en/latest/pdf/)

