<a href="https://colab.research.google.com/github/telsayed/IR-in-Arabic/blob/master/Summer2021/labs/day2/IR_in_Arabic_Lab2_Indexing%26ExploringIndexing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>



# **IR in Arabic** - Summer 2021 lab notebook 2


This is one of a series of Colab notebooks created for the **IR in Arabic** course. It demonstrates how we can index a collection, and how to access an index to visualize some index analysis.

The **learning outcomes** of the this notebook are:


*   PyTerrier setup.
*   Preprocessing.
*   Indexing a collection.
*   Accessing and exploring the index.

What is PyTerrier?

**[PyTerrier](https://pyterrier.readthedocs.io/en/latest/)** is a Python framework, but uses the underlying [Terrier information retrieval](http://terrier.org/) toolkit for many indexing and retrieval operations. While PyTerrier was new in 2020, Terrier is written in Java and has a long history dating back to 2001. PyTerrier makes it easy to perform IR experiments in Python, but using the mature Terrier platform for the expensive indexing and retrieval operations.


### **Setup**
We will first install Pyterrier as follows:

In [None]:
#install the Pyterrier framework
!pip install python-terrier

The next step is to initialise PyTerrier. This is performed using PyTerrier's init() method. The init() method is needed as PyTerrier must download Terrier's jar file and start the Java virtual machine. We prevent init() from being called more than once by checking started().

In [None]:
import pyterrier as pt
if not pt.started():
  pt.init()

Another library that we need for this lab is Arabic-Stopwords

In [None]:
#install the Arabic stop words library
!pip install Arabic-Stopwords

We will import all the python libraries needed for this lab

In [None]:
#we need to import the following libraries.
import pandas as pd
#to display the full text on the notebook without truncation
pd.set_option('display.max_colwidth', 150)
import re
from snowballstemmer import stemmer
import arabicstopwords.arabicstopwords as stp
#make your loops show a smart progress meter 
from tqdm import tqdm

### **What are DataFrames?** 
[Pandas DataFrames](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html): Two-dimensional, size-mutable, potentially heterogeneous tabular data. Arithmetic operations align on both row and column labels. Can be thought of as a dict-like container for Series objects.

In [None]:
#create a new dataframe
my_df=pd.DataFrame([["Ahmed",25,50000],["Fatima",35,690000],["Nada",45,460000]],columns=['name','age','salary'])
my_df

In [None]:
#insert a new row
my_df=my_df.append({'name':'Salwa','age':24,'salary':90000},ignore_index=True)
my_df

In [None]:
#print just name and salary
my_df[['name','salary']]

In [None]:
#print the data about people with salary>60000
my_df[my_df['salary']>60000]

In [None]:
#increase the salary of all by 1000
def increase_salary(salary):
    return salary+1000
    
my_df["salary"]=my_df["salary"].apply(increase_salary)
my_df

### **Data preparation**
We will first create five textual documents.

In [None]:
docs_df = pd.DataFrame([ ["d0", "هذا هو اليوم الأول من دورة استرجاع المعلومات"],
                        ["d1", "الدورة باللغة العربية للطلاب العرب"], 
                        ["d2", "اليوم هو 30 مايو 2021"], 
                        ["d3", "نأمل أن تفيد هذه الدورة الطلاب العرب"],
                        ["d4", "هل أنتم سعداء بهذه التجربة؟"] ],
                        columns=["docno", "raw_text"])

docs_df

Before indexing our data we need to do the following processing steps:


1.   **Remove stopwords.**
2.   **Normalization.**
3.   **Stemming.**




Let's remove the stopwords.

In [None]:
stp.stopwords_list()

In [None]:
len(stp.stopwords_list())

In [None]:
#removing Stop Words function
def remove_stopWords(sentence):
    terms=[]
    stopWords= set(stp.stopwords_list())
    for term in sentence.split() : 
        if term not in stopWords :
           terms.append(term)
    return " ".join(terms)

docs_df["text"]=docs_df["raw_text"].apply(remove_stopWords)
print("***************************************************************************documents after removing stopwords*********************************************************************")
docs_df

After removing the stopwords the next step is to normalize our documents.

In [None]:
#a function to normalize the tweets
def normalize(text):
    text = re.sub("[إأٱآا]", "ا", text)
    text = re.sub("ى", "ي", text)
    text = re.sub("ؤ", "ء", text)
    text = re.sub("ئ", "ء", text)
    text = re.sub("ة", "ه", text)
    return(text)

docs_df["text"]=docs_df["text"].apply(normalize)
print("***************************************************************************documents after normalizing*********************************************************************")
docs_df  

The last processing step is to stem the terms in each document.

In [None]:
#specify that we want to stem arabic text
ar_stemmer = stemmer("arabic")
#define the stemming function
def stem(sentence):
    return " ".join([ar_stemmer.stemWord(i) for i in sentence.split()])

docs_df['text']=docs_df['text'].apply(stem)
print("***************************************************************************documents after stemming*********************************************************************")
docs_df

Next, we will index the dataframe's documents. The index, with all its data structures, is saved into a directory called **myFirstIndex**.

In [None]:
indexer = pt.DFIndexer("./myFirstIndex", overwrite=True)
#as the default is an English tokenizer we will update it by setting it to a non-English tokenizer "UTFTokenizer"
indexer.setProperty("tokeniser", "UTFTokeniser")
# index the text, record the docnos as metadata
index_ref = indexer.index(docs_df["text"], docs_df["docno"])
index_ref.toString()

### **Explore the index**
An index has several data structures:

*    **the CollectionStatistics**- the salient global statistics of the index.
*    **the Lexicon** - the vocabulary of the index, including statistics of the terms, and a pointer into the inverted index.

* **the inverted index (a PostingIndex**) - contains the posting list for each term, detailing the frequency in which aterm appears in that document .
* **the DocumentIndex** - contains the length of the document (and other field lengths).  
* **the MetaIndex** - contains document metadata, such as the docno, and optionally the raw text and the URL ofeach document.
* **the direct index (also a PostingIndex)** - contains a posting list for each document, detailing which terms occuringthat document and which frequency. The presence of the direct index depends on the IndexingType that has beenapplied - single-pass and some memory indices do not provide a direct index.


Let's check the files the index files created.

In [None]:
!ls -lh myFirstIndex/

We can export our index into our machine as follows:

In [None]:
# from google.colab import files
# !zip -r ./myFirstIndex.zip ./myFirstIndex
# files.download("myFirstIndex.zip")

Let's check the statistics about the index we created.

In [None]:
print(index_ref.toString())
#we will first load the index
index = pt.IndexFactory.of(index_ref)
#we will call getCollectionStatistics() to check the stats
print(index.getCollectionStatistics().toString())

We can check the lexicon which is the **vocabulary** of the collection.

* Nt is the number of unique documents that each term occurs in.
* TF is the total number of occurrences – some weighting models use this instead of Nt.
* The numbers in the @{} are a pointer – they tell Terrier where the postings are for that term in the inverted index data structure.


In [None]:
for kv in index.getLexicon():
  print("%s -> %s " % (kv.getKey(), kv.getValue().toString())) 

we can also lookup a term in PyTerrier's lexicon:

In [None]:
index.getLexicon()["عرب"].toString()

**The inverted index** tells us in which documents each term occurs in. 
The LexiconEntry is the pointer that tell us where to find the postings for that term in the inverted index.

Let's look in which documents the word "العرب" occurs and its frequency in each document.

**Note:** we need to preprocess each search term with the same preprocessing steps we performed on the collection.

In [None]:
#preprocess the search term
term="العرب"
print("the term before normalization and stemming:", term)
#normalize the word
term=normalize(term)
#stem the word
term=ar_stemmer.stemWord(term)
print("the term after normalization and stemming:", term)
#search the term
try:
 pointer = index.getLexicon()[term]
 for posting in index.getInvertedIndex().getPostings(pointer):
    print(posting.toString() + " doclen=%d" % posting.getDocumentLength())
except:
    print("term %s not found"%term)

How many documents does term "العرب" occur in?

In [None]:
index.getLexicon()[term].getDocumentFrequency()

What terms occur in the 4th document?

In [None]:
di = index.getDirectIndex()
doi = index.getDocumentIndex()
lex = index.getLexicon()
docid = 3 #docids are 0-based #note: postings will be null if the document is empty
for posting in di.getPostings(doi.getDocumentEntry(docid)):
    termid = posting.getId()
    lee = lex.getLexiconEntry(termid)
    print("%s with frequency %d" % (lee.getKey(),posting.getFrequency()))

### **Indexing a bigger collection**
**[EveTAR](https://link.springer.com/article/10.1007/s10791-017-9325-7)**, the first freely-available tweet test collection for multiple IR tasks. EveTAR includes a crawl of 355M Arabic tweets and covers 50 significant events for which about 62K tweets were judged with substantial average inter-annotator agreement (Kappa value of 0.71).

First, we need to read the data from our Github repository. Note that we will use only a subset of 50K tweets in this lab.

In [None]:
dataset_links=["https://raw.githubusercontent.com/telsayed/IR-in-Arabic/master/Summer2021/data/EveTAR/tweets/evetar-q-01.txt",
               "https://raw.githubusercontent.com/telsayed/IR-in-Arabic/master/Summer2021/data/EveTAR/tweets/evetar-q-02.txt",
               "https://raw.githubusercontent.com/telsayed/IR-in-Arabic/master/Summer2021/data/EveTAR/tweets/evetar-q-03.txt",
               "https://raw.githubusercontent.com/telsayed/IR-in-Arabic/master/Summer2021/data/EveTAR/tweets/evetar-q-04.txt",
               "https://raw.githubusercontent.com/telsayed/IR-in-Arabic/master/Summer2021/data/EveTAR/tweets/evetar-q-05.txt",
               "https://raw.githubusercontent.com/telsayed/IR-in-Arabic/master/Summer2021/data/EveTAR/tweets/evetar-q-06.txt",
               "https://raw.githubusercontent.com/telsayed/IR-in-Arabic/master/Summer2021/data/EveTAR/tweets/evetar-q-07.txt",
               "https://raw.githubusercontent.com/telsayed/IR-in-Arabic/master/Summer2021/data/EveTAR/tweets/evetar-q-08.txt",
               "https://raw.githubusercontent.com/telsayed/IR-in-Arabic/master/Summer2021/data/EveTAR/tweets/evetar-q-09.txt",
               "https://raw.githubusercontent.com/telsayed/IR-in-Arabic/master/Summer2021/data/EveTAR/tweets/evetar-q-10.txt"]

full_data=pd.DataFrame()
for i in tqdm(range(len(dataset_links))):
    tweets=pd.read_csv(dataset_links[i], sep='\t')
    full_data=pd.concat([full_data,tweets],ignore_index=True)
full_data.reset_index(inplace=True,drop=True)
full_data


In [None]:

#the docno will be our tweetID
full_data["docno"]=full_data["tweetID"].astype(str)
full_data[["docno"]]  

We will perform the same processing steps mentioned above but because we will index a collection of the tweets we need to clean the tweets before the other processsing steps them (remove the urls, emojies....)

In [None]:
#a function to clean the tweets
def clean(text):
   text = re.sub(r"http\S+", " ", text) # remove urls
   text = re.sub(r"RT ", " ", text) # remove rt
   text = re.sub(r"@[\w]*", " ", text) # remove handles
   text = re.sub(r"[\.\,\#_\|\:\?\?\/\=]", " ", text) # remove special characters
   text = re.sub(r'\t', ' ', text) # remove tabs
   text = re.sub(r'\n', ' ', text) # remove line jump
   text = re.sub(r"\s+", " ", text) # remove extra white space
   accents = re.compile(r'[\u064b-\u0652\u0640]') # harakaat and tatweel (kashida) to remove
     
   arabic_punc= re.compile(r'[\u0621-\u063A\u0641-\u064A\d+]+') # Keep only Arabic letters/do not remove number
   text=' '.join(arabic_punc.findall(accents.sub('',text)))
   text = text.strip()
   return text


#we will clean each tweet in the collection
full_data["text"]=full_data["tweetText"].apply(clean)
print("***************************************************************************Tweets after cleaning*********************************************************************")
full_data[['docno','tweetText','text']]

We will remove the stop words.

In [None]:
full_data["text"]=full_data["text"].apply(remove_stopWords)
print("***************************************************************************Tweets after removing stopWords*********************************************************************")
full_data[['docno','tweetText','text']]

We also need to normalize the tweets

In [None]:
#we will normalize using our normalize function. 
full_data["text"]=full_data["text"].apply(normalize)
print("***************************************************************************Tweets after normalizing*********************************************************************")
full_data[['docno','tweetText','text']]   

Stemming the collection (this will take up 2 minutes)

In [None]:
full_data['text']=full_data['text'].apply(stem)
print("***************************************************************************Tweets after stemming*********************************************************************")
full_data[['docno','tweetText','text']]   


### **Indexing EveTAR**



In [None]:
indexer = pt.DFIndexer("./evetarIndex", overwrite=True)
#as the default id an English tokenizer we will update it by setting it to a non-English tokenizer "UTFTokenizer"
indexer.setProperty("tokeniser", "UTFTokeniser")
index_ref = indexer.index(full_data["text"], full_data["docno"])
index_ref.toString()

### **Explore the index**

In [None]:
#we will first load the index
index = pt.IndexFactory.of(index_ref)
#we will call getCollectionStatistics() to check the stats
print(index.getCollectionStatistics().toString())

Let's check the vocab in our index.

In [None]:
#check the vocab
for kv in index.getLexicon():
  print("%s -> %s " % (kv.getKey(), kv.getValue().toString())) 

### **Exercise1**
How many documents mention your country name? which documents are those?

### **Exercise2**
Select any document from the collection and check which of its terms appear in the index?


### **Exercise3**
How can we update our index to include the positions of the terms in the index? Hint: you can use [PyTerrier documentation](https://pyterrier.readthedocs.io/_/downloads/en/latest/pdf/) as a reference.

### **Exercise4**
Index an Arabic collection of your choice. You can use the Arabic datasets available at [Huggingface](https://huggingface.co/datasets?filter=languages:ar).

### **References**


* [Pandas DataFrames documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html).  
* IR From Bag-of-words to BERT and Beyond through Practical Experiments. [PyTerrier ECIR2021 Tutorial](https://github.com/terrier-org/ecir2021tutorial).
*   [PyTerrier documentation.](https://pyterrier.readthedocs.io/_/downloads/en/latest/pdf/)
* [Processing Arabic text in Python](https://alraqmiyyat.github.io/2013/01-02.html).

