<a href="https://colab.research.google.com/github/telsayed/IR-in-Arabic/blob/master/Summer2021/labs/day2/IR_in_Arabic_LabA_Indexing%26ExploringIndex.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>



# **IR in Arabic** - Summer 2021 lab notebook 2


This is one of a series of Colab notebooks created for the **IR in Arabic** course. It demonstrates how we can index a collection, and how to access an index to visualize some index analysis.

The **learning outcomes** of the this notebook are:


*   PyTerrier setup.
*   Preprocessing.
*   Indexing a collection.
*   Accessing and exploring the index.

What is PyTerrier?

**[PyTerrier](https://pyterrier.readthedocs.io/en/latest/)** is a Python framework, but uses the underlying [Terrier information retrieval](http://terrier.org/) toolkit for many indexing and retrieval operations. While PyTerrier was new in 2020, Terrier is written in Java and has a long history dating back to 2001. PyTerrier makes it easy to perform IR experiments in Python, but using the mature Terrier platform for the expensive indexing and retrieval operations.


### **Setup**
We will first install Pyterrier as follows:

In [None]:
#install the Pyterrier framework
!pip install python-terrier



The next step is to initialise PyTerrier. This is performed using PyTerrier's init() method. The init() method is needed as PyTerrier must download Terrier's jar file and start the Java virtual machine. We prevent init() from being called more than once by checking started().

In [None]:
import pyterrier as pt
if not pt.started():
  pt.init()

Another library that we need for this lab is Arabic-Stopwords

In [None]:
#install the Arabic stop words library
!pip install Arabic-Stopwords



We will import all the python libraries needed for this lab

In [None]:
#we need to import the following libraries.
import pandas as pd
#to display the full text on the notebook without truncation
pd.set_option('display.max_colwidth', 150)
import re
from snowballstemmer import stemmer
from tqdm import tqdm
import arabicstopwords.arabicstopwords as stp

### **Data preparation**
We will first create five textual documents.

In [None]:
docs_df = pd.DataFrame([ ["d0", "هذا هو اليوم الأول من دورة استرجاع المعلومات"],
                        ["d1", "الدورة باللغة العربية للطلاب العرب"], 
                        ["d2", "اليوم هو 30 مايو 2021"], 
                        ["d3", "نأمل أن تفيد هذه الدورة الطلاب العرب"],
                        ["d4", "هل أنتم سعداء بهذه التجربة؟"] ],
                        columns=["docno", "text"])

docs_df

Unnamed: 0,docno,text
0,d0,هذا هو اليوم الأول من دورة استرجاع المعلومات
1,d1,الدورة باللغة العربية للطلاب العرب
2,d2,اليوم هو 30 مايو 2021
3,d3,نأمل أن تفيد هذه الدورة الطلاب العرب
4,d4,هل أنتم سعداء بهذه التجربة؟


Before indexing our data we need to do the following processing steps:


1.   **Remove stopwords.**
2.   **Normalization.**
3.   **Stemming.**




Let's remove the stopwords.

In [None]:
#removing Stop Words function
def RemoveStopWords(sentence):
    terms=[]
    stopWords= set(stp.stopwords_list())
    for term in sentence.split() : 
        if term not in stopWords :
           terms.append(term)
    return " ".join(terms)
docs_df["text"]=docs_df["text"].apply(RemoveStopWords)
docs_df

Unnamed: 0,docno,text
0,d0,اليوم الأول دورة استرجاع المعلومات
1,d1,الدورة باللغة العربية للطلاب العرب
2,d2,اليوم 30 مايو 2021
3,d3,نأمل تفيد الدورة الطلاب العرب
4,d4,سعداء التجربة؟


After removing the stopwords the next step is to normalize our documents.

In [None]:
#a function to normalize the tweets
def normalize(text):
    text = re.sub("[إأٱآا]", "ا", text)
    text = re.sub("ى", "ي", text)
    text = re.sub("ؤ", "ء", text)
    text = re.sub("ئ", "ء", text)
    text = re.sub("ة", "ه", text)
    return(text)

docs_df["text"]=docs_df["text"].apply(normalize)
print("***************************************************************************documents after normalizing*********************************************************************")
print(docs_df[['docno','text']])   

***************************************************************************documents after normalizing*********************************************************************
  docno                                text
0    d0  اليوم الاول دوره استرجاع المعلومات
1    d1  الدوره باللغه العربيه للطلاب العرب
2    d2                  اليوم 30 مايو 2021
3    d3       نامل تفيد الدوره الطلاب العرب
4    d4                      سعداء التجربه؟


The last processing step is to stem the terms in each document.

In [None]:
ar_stemmer = stemmer("arabic")
docs_text=docs_df['text'].tolist()
output = []
for text in tqdm(docs_text):
    output.append(" ".join([ar_stemmer.stemWord(i) for i in text.split()]))
docs_df['text']=output
docs_df


100%|██████████| 5/5 [00:00<00:00, 225.82it/s]


Unnamed: 0,docno,text
0,d0,يوم اول دور استرجاع معلوم
1,d1,دور لغه عرب طلاب عرب
2,d2,يوم 30 مايو 2021
3,d3,نامل تفيد دور طلاب عرب
4,d4,سعداء تجربه؟


Next, we will index the dataframe's documents. The index, with all its data structures, is saved into a directory called **myFirstIndex**.

In [None]:
indexer = pt.DFIndexer("./myFirstIndex", overwrite=True)
#as the default id an English tokenizer we will update it by setting it to a non-English tokenizer "UTFTokenizer"
indexer.setProperty("tokeniser", "UTFTokeniser")
index_ref = indexer.index(docs_df["text"], docs_df["docno"])
index_ref.toString()

'./myFirstIndex/data.properties'

### **Explore the index**
An index has several data structures:

*    **the CollectionStatistics**- the salient global statistics of the index.
*    **the Lexicon** - the vocabulary of the index, including statistics of the terms, and a pointer into the inverted index.

* **the inverted index (a PostingIndex**) - contains the posting list for each term, detailing the frequency in which aterm appears in that document .
* **the DocumentIndex** - contains the length of the document (and other field lengths).  
* **the MetaIndex** - contains document metadata, such as the docno, and optionally the raw text and the URL ofeach document.
* **the direct index (also a PostingIndex)** - contains a posting list for each document, detailing which terms occuringthat document and which frequency. The presence of the direct index depends on the IndexingType that has beenapplied - single-pass and some memory indices do not provide a direct index.


In [None]:
# !rm myFirstIndex/*
!ls -lh myFirstIndex/

total 40K
-rw-r--r-- 1 root root   10 May 25 10:52 data.direct.bf
-rw-r--r-- 1 root root   85 May 25 10:52 data.document.fsarrayfile
-rw-r--r-- 1 root root   11 May 25 10:52 data.inverted.bf
-rw-r--r-- 1 root root 1.3K May 25 10:52 data.lexicon.fsomapfile
-rw-r--r-- 1 root root  441 May 25 10:52 data.lexicon.fsomaphash
-rw-r--r-- 1 root root   60 May 25 10:52 data.lexicon.fsomapid
-rw-r--r-- 1 root root   40 May 25 10:52 data.meta.idx
-rw-r--r-- 1 root root   80 May 25 10:52 data.meta.zdata
-rw-r--r-- 1 root root 4.1K May 25 10:52 data.properties


Let's check the statistics about the index we created.

In [None]:
#we will first load the index
index = pt.IndexFactory.of(index_ref)
#we will call getCollectionStatistics() to check the stats
print(index.getCollectionStatistics().toString())

Number of documents: 5
Number of terms: 15
Number of postings: 20
Number of fields: 0
Number of tokens: 21
Field names: []
Positions:   false



We can check the lexicon which is the **vocabulary** of the collection.

In [None]:
for kv in index.getLexicon():
  print("%s -> %s " % (kv.getKey(), kv.getValue().toString())) 

2021 -> term10 Nt=1 TF=1 maxTF=1 @{0 0 0} 
30 -> term8 Nt=1 TF=1 maxTF=1 @{0 0 4} 
استرجاع -> term0 Nt=1 TF=1 maxTF=1 @{0 1 0} 
اول -> term1 Nt=1 TF=1 maxTF=1 @{0 1 2} 
تجربه -> term14 Nt=1 TF=1 maxTF=1 @{0 1 4} 
تفيد -> term12 Nt=1 TF=1 maxTF=1 @{0 2 2} 
دور -> term2 Nt=3 TF=3 maxTF=1 @{0 3 0} 
سعداء -> term13 Nt=1 TF=1 maxTF=1 @{0 4 0} 
طلاب -> term6 Nt=2 TF=2 maxTF=1 @{0 4 6} 
عرب -> term7 Nt=2 TF=3 maxTF=2 @{0 5 6} 
لغه -> term5 Nt=1 TF=1 maxTF=1 @{0 6 7} 
مايو -> term9 Nt=1 TF=1 maxTF=1 @{0 7 3} 
معلوم -> term4 Nt=1 TF=1 maxTF=1 @{0 7 7} 
نامل -> term11 Nt=1 TF=1 maxTF=1 @{0 8 1} 
يوم -> term3 Nt=2 TF=2 maxTF=1 @{0 8 7} 


The inverted index tells us in which documents each term occurs in. 
The LexiconEntry is the pointer that tell us where to find the postings for that term in the inverted index.

Let's look in which documents the word "العرب" occurs and its frequency in each document.

In [None]:
term="العرب"
term=normalize(term)
term=ar_stemmer.stemWord(term)
print("the term after normalization and stemming:", term)
pointer = index.getLexicon()[term]
for posting in index.getInvertedIndex().getPostings(pointer):
    print(posting.toString() + " doclen=%d" % posting.getDocumentLength())

the term after normalization and stemming: عرب
ID(1) TF(2) doclen=5
ID(3) TF(1) doclen=5


How many documents does term "العرب" occur in?

In [None]:
index.getLexicon()[term].getDocumentFrequency()

2

What terms occur in the 4th document?

In [None]:
di = index.getDirectIndex()
doi = index.getDocumentIndex()
lex = index.getLexicon()
docid = 3 #docids are 0-based #note: postings will be null if the document is empty
for posting in di.getPostings(doi.getDocumentEntry(docid)):
    termid = posting.getId()
    lee = lex.getLexiconEntry(termid)
    print("%s with frequency %d" % (lee.getKey(),posting.getFrequency()))

دور with frequency 1
طلاب with frequency 1
عرب with frequency 1
نامل with frequency 1
تفيد with frequency 1


### **Indexing a bigger collection**
**[EveTAR](https://link.springer.com/article/10.1007/s10791-017-9325-7)**, the first freely-available tweet test collection for multiple IR tasks. EveTAR includes a crawl of 355M Arabic tweets and covers 50 significant events for which about 62K tweets were judged with substantial average inter-annotator agreement (Kappa value of 0.71).

First, we need to read the data from our Github repository. Note that we will use only a subset of 50K tweets in this lab.

In [None]:
dataset_links=["https://raw.githubusercontent.com/telsayed/IR-in-Arabic/master/Summer2021/data/EveTAR/tweets/evetar-q-01.txt",
               "https://raw.githubusercontent.com/telsayed/IR-in-Arabic/master/Summer2021/data/EveTAR/tweets/evetar-q-02.txt",
               "https://raw.githubusercontent.com/telsayed/IR-in-Arabic/master/Summer2021/data/EveTAR/tweets/evetar-q-03.txt",
               "https://raw.githubusercontent.com/telsayed/IR-in-Arabic/master/Summer2021/data/EveTAR/tweets/evetar-q-04.txt",
               "https://raw.githubusercontent.com/telsayed/IR-in-Arabic/master/Summer2021/data/EveTAR/tweets/evetar-q-05.txt",
               "https://raw.githubusercontent.com/telsayed/IR-in-Arabic/master/Summer2021/data/EveTAR/tweets/evetar-q-06.txt",
               "https://raw.githubusercontent.com/telsayed/IR-in-Arabic/master/Summer2021/data/EveTAR/tweets/evetar-q-07.txt",
               "https://raw.githubusercontent.com/telsayed/IR-in-Arabic/master/Summer2021/data/EveTAR/tweets/evetar-q-08.txt",
               "https://raw.githubusercontent.com/telsayed/IR-in-Arabic/master/Summer2021/data/EveTAR/tweets/evetar-q-09.txt",
               "https://raw.githubusercontent.com/telsayed/IR-in-Arabic/master/Summer2021/data/EveTAR/tweets/evetar-q-10.txt"]

full_data=pd.DataFrame()
for i in tqdm(range(len(dataset_links))):
    tweets=pd.read_csv(dataset_links[i], sep='\t')
    full_data=pd.concat([full_data,tweets],ignore_index=True)
full_data.reset_index(inplace=True,drop=True)
#the docno will be our tweetID
full_data["docno"]=full_data["tweetID"].astype(str)
full_data  

100%|██████████| 10/10 [00:00<00:00, 10.13it/s]


Unnamed: 0,tweetID,tweetText,docno
0,549679192804061184,"الاعدام لعامل مطعم قتل زميله طعناً في ""البيادر"" أيدت محكمة التمييز الحكم الصادر عن محكمة الجنايات الكبرى والقاضي... http://t.co/H0txdjv3Kn",549679192804061184
1,549699343666532352,#الأخبار ▪ تأجيل محاكمة 7 إرهابيين بسبب غياب الدفاع: أجلت محكمة الجنايات بالعاصمة إلى تاريخ لاحق محاكمة سبعة إ... http://t.co/GM4jmpAWbR,549699343666532352
2,549711593487888387,@helale9999 عشآن أعطيتك وحده صميم صرت ترمي أعذار ...حقق العالميةة و أرجع كلمني يَ الأياب الانتحاري,549711593487888387
3,549719610459967488,#النهدي ثمانية قتلى في تفجير انتحاري بسيارة مفخخة أمام معملين للغاز في ريف حمص - شبكة الصين http://t.co/r5zFEuzAPu,549719610459967488
4,549720880717508608,البحرين: ضبط مطلوبين متورطين في التفجير بالعكر الشرقي بقية الموضوع اضغط هنا http://t.co/t4A5bNrqyh,549720880717508608
...,...,...,...
49995,561985373048299520,مواسيا الشعب السعودي..حاكم دبي يبدأ جلسة مجلس الوزراء بقراءة الفاتحة على الملك عبدالله #الخبر #السعودية #saudi #ksa,561985373048299520
49996,561987332878766081,@al_shalal @F_D_A82 تم تفجير صماخنا,561987332878766081
49997,561988825186971650,@aubyazid123 جزاك الله ألف خير ❌ جزاك الله خير ✔️ - كلمة ألف فيها تحجير لخير الله.,561988825186971650
49998,561991173360091136,كيف نفّذت «النصرة» عمليّة تفجير الحافلة اللبنانية في دمشق؟ http://t.co/TEmP1Dso1v,561991173360091136


We will perform the same processing steps mentioned above but because we will index a collection of the tweets we need to clean the tweets before the other processsing steps them (remove the urls, emojies....)

In [None]:
#a function to clean the tweets
def clean(text):
   text = re.sub(r"http\S+", " ", text) # remove urls
   text = re.sub(r"RT ", " ", text) # remove rt
   text = re.sub(r"@[\w]*", " ", text) # remove handles
   text = re.sub(r"[\.\,\#_\|\:\?\?\/\=]", " ", text) # remove special characters
   text = re.sub(r'\t', ' ', text) # remove tabs
   text = re.sub(r'\n', ' ', text) # remove line jump
   text = re.sub(r"\s+", " ", text) # remove extra white space
   accents = re.compile(r'[\u064b-\u0652\u0640]') # harakaat and tatweel (kashida) to remove
     
   arabic_punc= re.compile(r'[\u0621-\u063A\u0641-\u064A\d+]+') # Keep only Arabic letters/do not remove number
   text=' '.join(arabic_punc.findall(accents.sub('',text)))
   text = text.strip()
   return text
#we will clean each tweet in the collection
print("***************************************************************************Tweets before cleaning*********************************************************************")
print(full_data[['docno','tweetText']])
full_data["tweetText"]=full_data["tweetText"].apply(clean)
print("***************************************************************************Tweets after cleaning*********************************************************************")
print(full_data[['docno','tweetText']])

***************************************************************************Tweets before cleaning*********************************************************************
                    docno                                                                                                                                   tweetText
0      549679192804061184  الاعدام لعامل مطعم قتل زميله طعناً في "البيادر" أيدت محكمة التمييز الحكم الصادر عن محكمة الجنايات الكبرى والقاضي... http://t.co/H0txdjv3Kn
1      549699343666532352    #الأخبار ▪ تأجيل محاكمة 7 إرهابيين بسبب غياب الدفاع: أجلت محكمة الجنايات بالعاصمة إلى تاريخ لاحق محاكمة سبعة إ... http://t.co/GM4jmpAWbR
2      549711593487888387                                          @helale9999 عشآن أعطيتك وحده صميم صرت ترمي أعذار ...حقق العالميةة و أرجع كلمني يَ الأياب الانتحاري
3      549719610459967488                          #النهدي ثمانية قتلى في تفجير انتحاري بسيارة مفخخة أمام معملين للغاز في ريف حمص - شبكة الصين http://t.co/r5zFEuzAPu
4  

We will remove the stop words.

In [None]:
full_data["tweetText"]=full_data["tweetText"].apply(RemoveStopWords)
print("***************************************************************************Tweets after removing stopWords*********************************************************************")
print(full_data[['docno','tweetText']])

***************************************************************************Tweets after removing stopWords*********************************************************************
                    docno                                                                                                tweetText
0      549679192804061184  الاعدام لعامل مطعم قتل زميله طعنا البيادر أيدت محكمة التمييز الحكم الصادر محكمة الجنايات الكبرى والقاضي
1      549699343666532352   الأخبار تأجيل محاكمة 7 إرهابيين بسبب غياب الدفاع أجلت محكمة الجنايات بالعاصمة تاريخ لاحق محاكمة سبعة إ
2      549711593487888387                           عشآن أعطيتك وحده صميم صرت ترمي أعذار حقق العالميةة أرجع كلمني الأياب الانتحاري
3      549719610459967488                            النهدي ثمانية قتلى تفجير انتحاري بسيارة مفخخة معملين للغاز ريف حمص شبكة الصين
4      549720880717508608                                      البحرين ضبط مطلوبين متورطين التفجير بالعكر الشرقي بقية الموضوع اضغط
...                   ...             

We also need to normalize the tweets

In [None]:
#we will normalize using our normalize function. 
full_data["tweetText"]=full_data["tweetText"].apply(normalize)
print("***************************************************************************Tweets after normalizing*********************************************************************")
print(full_data[['docno','tweetText']])   

***************************************************************************Tweets after normalizing*********************************************************************
                    docno                                                                                                tweetText
0      549679192804061184  الاعدام لعامل مطعم قتل زميله طعنا البيادر ايدت محكمه التمييز الحكم الصادر محكمه الجنايات الكبري والقاضي
1      549699343666532352   الاخبار تاجيل محاكمه 7 ارهابيين بسبب غياب الدفاع اجلت محكمه الجنايات بالعاصمه تاريخ لاحق محاكمه سبعه ا
2      549711593487888387                           عشان اعطيتك وحده صميم صرت ترمي اعذار حقق العالميهه ارجع كلمني الاياب الانتحاري
3      549719610459967488                            النهدي ثمانيه قتلي تفجير انتحاري بسياره مفخخه معملين للغاز ريف حمص شبكه الصين
4      549720880717508608                                      البحرين ضبط مطلوبين متورطين التفجير بالعكر الشرقي بقيه الموضوع اضغط
...                   ...                    

Stemming the collection (this will take time up 2 minutes)

In [None]:
tweets_text=full_data['tweetText'].tolist()
output = []
for tweet in tqdm(tweets_text):
    output.append(" ".join([ar_stemmer.stemWord(i) for i in tweet.split()]))
full_data['tweetText']=output


100%|██████████| 50000/50000 [01:39<00:00, 501.83it/s]


In [None]:
#the tweets after stemming
print(full_data[['docno','tweetText']])   

                    docno                                                                         tweetText
0      549679192804061184  اعدام لعامل مطعم قتل زميل طعنا بيادر ايد محكم تمييز حكم صادر محكم جنا كبر والقاض
1      549699343666532352    اخبار تاجيل محا 7 ارهابي سبب غياب دفاع اجل محكم جنا عاصمه تاريخ لاحق محا سبع ا
2      549711593487888387                 عشان اعطي وحد صميم صرت ترم اعذار حقق عالميهه ارجع كلم اياب انتحار
3      549719610459967488                       نهد ثما قتل تفجير انتحار سيار مفخخ معمل غاز ريف حمص شبك صين
4      549720880717508608                                  بحر ضبط مطلوب متورط تفجير عكر شرق بقي موضوع اضغط
...                   ...                                                                               ...
49995  561985373048299520        مواسي شعب سعود حاكم دب يبد جلس مجلس وزراء قراء فاتحه ملك عبدالل خبر سعوديه
49996  561987332878766081                                                                     تم تفجير صماخ
49997  561988825186971650   

### **Indexing EveTAR**



In [None]:
indexer = pt.DFIndexer("./evetarIndex", overwrite=True)
#as the default id an English tokenizer we will update it by setting it to a non-English tokenizer "UTFTokenizer"
indexer.setProperty("tokeniser", "UTFTokeniser")
index_ref = indexer.index(full_data["tweetText"], full_data["docno"])
index_ref.toString()

'./evetarIndex/data.properties'

### **Explore the index**

In [None]:
#we will first load the index
index = pt.IndexFactory.of(index_ref)
#we will call getCollectionStatistics() to check the stats
print(index.getCollectionStatistics().toString())

Number of documents: 50000
Number of terms: 25045
Number of postings: 499710
Number of fields: 0
Number of tokens: 537706
Field names: []
Positions:   false



**A note to the tester:** Please check the vocab below is و a stop word, check words such as و اللاعب  and و الليل ....
 also words such as هالاسبوع are not ها should be removed during stemming

In [None]:
#check the vocab
for kv in index.getLexicon():
  print("%s -> %s " % (kv.getKey(), kv.getValue().toString())) 

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
نطلق -> term13464 Nt=3 TF=3 maxTF=1 @{0 593081 4} 
نطم -> term19640 Nt=2 TF=2 maxTF=1 @{0 593092 0} 
نطو -> term16197 Nt=1 TF=1 maxTF=1 @{0 593099 4} 
نطيط -> term24695 Nt=3 TF=3 maxTF=1 @{0 593103 2} 
نظ -> term18812 Nt=1 TF=1 maxTF=1 @{0 593111 6} 
نظاف -> term11993 Nt=5 TF=5 maxTF=1 @{0 593115 4} 
نظافه -> term22054 Nt=1 TF=1 maxTF=1 @{0 593128 0} 
نظاكم -> term22460 Nt=1 TF=1 maxTF=1 @{0 593132 0} 
نظال -> term21681 Nt=1 TF=1 maxTF=1 @{0 593136 0} 
نظام -> term1494 Nt=127 TF=132 maxTF=2 @{0 593140 0} 
نظر -> term284 Nt=79 TF=81 maxTF=2 @{0 593379 5} 
نظرا -> term16981 Nt=3 TF=3 maxTF=1 @{0 593533 1} 
نظرت -> term19771 Nt=1 TF=1 maxTF=1 @{0 593541 5} 
نظف -> term21784 Nt=1 TF=1 maxTF=1 @{0 593545 5} 
نظل -> term21909 Nt=1 TF=1 maxTF=1 @{0 593549 5} 
نظم -> term1053 Nt=15 TF=15 maxTF=1 @{0 593553 5} 
نظير -> term12355 Nt=51 TF=51 maxTF=1 @{0 593589 1} 
نظيف -> term13930 Nt=21 TF=21 maxTF=1 @{0 593672 3} 
نع -> term19481

### **Exercise1**
How many documents mention your country name? which documents are those?

In [None]:
term="الجزائر"
term=normalize(term)
term=ar_stemmer.stemWord(term)
print("The number of documents that mention %s is %s"%(term,index.getLexicon()[term].getDocumentFrequency()))


The number of documents that mention جزاءر is 337


In [None]:
pointer = index.getLexicon()[term]
for posting in index.getInvertedIndex().getPostings(pointer):
    print(posting.toString() + " doclen=%d" % posting.getDocumentLength())

ID(608) TF(2) doclen=14
ID(761) TF(1) doclen=12
ID(1406) TF(1) doclen=9
ID(2230) TF(1) doclen=11
ID(3362) TF(1) doclen=12
ID(4501) TF(1) doclen=19
ID(5047) TF(1) doclen=7
ID(5186) TF(1) doclen=13
ID(5187) TF(1) doclen=13
ID(5188) TF(1) doclen=13
ID(5189) TF(1) doclen=13
ID(5190) TF(1) doclen=13
ID(5389) TF(1) doclen=13
ID(5390) TF(1) doclen=13
ID(5432) TF(1) doclen=13
ID(5433) TF(1) doclen=13
ID(5434) TF(1) doclen=13
ID(5435) TF(1) doclen=13
ID(5551) TF(1) doclen=13
ID(5553) TF(1) doclen=13
ID(5554) TF(1) doclen=13
ID(5570) TF(1) doclen=17
ID(5773) TF(1) doclen=18
ID(6425) TF(1) doclen=15
ID(6426) TF(1) doclen=15
ID(6427) TF(1) doclen=15
ID(6428) TF(1) doclen=15
ID(6429) TF(1) doclen=15
ID(6604) TF(1) doclen=16
ID(6605) TF(1) doclen=16
ID(6607) TF(1) doclen=16
ID(6608) TF(1) doclen=16
ID(6687) TF(1) doclen=15
ID(6688) TF(1) doclen=15
ID(8514) TF(1) doclen=15
ID(8515) TF(1) doclen=15
ID(8672) TF(1) doclen=15
ID(8674) TF(1) doclen=15
ID(8742) TF(1) doclen=15
ID(8743) TF(1) doclen=15
ID(8

### **Exercise2**
Select any document from the collection and check which of its terms appear in the index?


In [None]:
#let's say we want to check the 100th document
di = index.getDirectIndex()
doi = index.getDocumentIndex()
lex = index.getLexicon()
docid = 100 #docids are 0-based #note: postings will be null if the document is empty
for posting in di.getPostings(doi.getDocumentEntry(docid)):
    termid = posting.getId()
    lee = lex.getLexiconEntry(termid)
    print("%s with frequency %d" % (lee.getKey(),posting.getFrequency()))

اخبار with frequency 1
تفجير with frequency 1
استقبل with frequency 1
جريح with frequency 1
نواب with frequency 1
طبرق with frequency 2
استهدف with frequency 1
مجلس with frequency 1
مقر with frequency 1
جراء with frequency 1
مركز with frequency 1
19 with frequency 1
دردنيل with frequency 1
الطب with frequency 1


In [None]:
# #temporary to exporte the index to the local machine and keep a copy in our github
!ls -lh evetarIndex/
# # #download and save the index
#from google.colab import files
# files.download("data.direct.bf")
!zip -r ./evetarIndex.zip ./evetarIndex
files.download("evetarIndex.zip")

total 6.7M
-rw-r--r-- 1 root root 1.1M May 25 10:58 data.direct.bf
-rw-r--r-- 1 root root 831K May 25 10:58 data.document.fsarrayfile
-rw-r--r-- 1 root root 671K May 25 10:58 data.inverted.bf
-rw-r--r-- 1 root root 2.1M May 25 10:58 data.lexicon.fsomapfile
-rw-r--r-- 1 root root 1.1K May 25 10:58 data.lexicon.fsomaphash
-rw-r--r-- 1 root root  98K May 25 10:58 data.lexicon.fsomapid
-rw-r--r-- 1 root root 391K May 25 10:58 data.meta.idx
-rw-r--r-- 1 root root 1.7M May 25 10:58 data.meta.zdata
-rw-r--r-- 1 root root 4.1K May 25 10:58 data.properties


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

### **References**


*   IR From Bag-of-words to BERT and Beyond through Practical Experiments. [PyTerrier ECIR2021 Tutorial](https://github.com/terrier-org/ecir2021tutorial).
*   [PyTerrier documentation.](https://pyterrier.readthedocs.io/_/downloads/en/latest/pdf/)

