# Indexing and Search


In this notebook we want to try some indexing and search techniques. This could be a first approach to the question answering problem in which the questions are the queries. The limit of this technique is that we can't answer to a question but we can just retrieve the most similar context to it. Also we don't consider the semantic of a question, but only it's TF-IDF like vectorizations.

## Dataset import and tool installation

In [1]:
import os
from google.colab import drive

drive.mount('/content/drive')
os.chdir('/content/drive/MyDrive/')

os.getcwd()

Mounted at /content/drive


'/content/drive/MyDrive'

In [2]:
!pip install -q python-terrier

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/104.4 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m104.4/104.4 kB[0m [31m3.3 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.4/1.4 MB[0m [31m28.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m69.6/69.6 kB[0m [31m7.8 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m311.5/311.5 kB[0m [31m19.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m46.5/46.5 kB[0m [31m4.3 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[

In [3]:
import pyterrier as pt
if not pt.started():
  pt.init()

terrier-assemblies 5.7 jar-with-dependencies not found, downloading to /root/.pyterrier...
Done
terrier-python-helper 0.0.7 jar not found, downloading to /root/.pyterrier...
Done


PyTerrier 0.9.2 has loaded Terrier 5.7 (built by craigm on 2022-11-10 18:30) and terrier-helper 0.0.7



In [4]:
import pandas as pd
import json

In [5]:
data_name = "train-v2.0.json"
data = open(data_name, 'r')
data_json = json.loads(data.read())

dataset = data_json['data']

In the following cell the dataset is prepared in order to be ready to work on it according to the python-terrier library input requirements.

In [6]:
contexts = [paragraph['context'] for sample in dataset for paragraph in sample['paragraphs'] ]

contexts_df = pd.DataFrame(
    [
        ['d'+str(i+1), context] for i, context in enumerate(contexts)
    ],
    columns=["docno", "text"]
)
contexts_df.head()

Unnamed: 0,docno,text
0,d1,Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ b...
1,d2,Following the disbandment of Destiny's Child i...
2,d3,"A self-described ""modern-day feminist"", Beyonc..."
3,d4,"Beyoncé Giselle Knowles was born in Houston, T..."
4,d5,Beyoncé attended St. Mary's Elementary School ...


## Indexing

The first step in order to apply any kind of search is to index the whole dataset.

In [7]:
indexer = pt.DFIndexer("./index_contexts", overwrite=True)
index_ref = indexer.index(contexts_df["text"], contexts_df["docno"])
index_ref.toString()

  for column, value in meta_column[1].iteritems():


'./index_contexts/data.properties'

In [8]:
!ls -lh index_contexts/

total 9.2M
-rw------- 1 root root 1.7M May 27 21:35 data.direct.bf
-rw------- 1 root root 317K May 27 21:35 data.document.fsarrayfile
-rw------- 1 root root 1.4M May 27 21:35 data.inverted.bf
-rw------- 1 root root 4.7M May 27 21:35 data.lexicon.fsomapfile
-rw------- 1 root root 1017 May 27 21:35 data.lexicon.fsomaphash
-rw------- 1 root root 220K May 27 21:35 data.lexicon.fsomapid
-rw------- 1 root root 428K May 27 21:35 data.meta-0.fsomapfile
-rw------- 1 root root 149K May 27 21:35 data.meta.idx
-rw------- 1 root root 399K May 27 21:35 data.meta.zdata
-rw------- 1 root root 4.2K May 27 21:35 data.properties


Let's print now some information about our collection (made of all contexts of the dataset).

In [9]:
index = pt.IndexFactory.of(index_ref)
print(index.getCollectionStatistics().toString())

Number of documents: 19035
Number of terms: 56194
Number of postings: 991710
Number of fields: 0
Number of tokens: 1284716
Field names: []
Positions:   false



In the following cell we can observe how many times the word *nintendo* is used in each context that it occurs in. It's reported also the length of each context.

In [10]:
pointer = index.getLexicon()["nintendo"]
for posting in index.getInvertedIndex().getPostings(pointer):
    print(f'{posting.toString()} doclen = {posting.getDocumentLength()}')

ID(57) TF(1) doclen = 42
ID(280) TF(2) doclen = 64
ID(282) TF(1) doclen = 54
ID(294) TF(2) doclen = 92
ID(295) TF(2) doclen = 54
ID(296) TF(1) doclen = 133
ID(298) TF(2) doclen = 36
ID(299) TF(2) doclen = 109
ID(300) TF(1) doclen = 111
ID(302) TF(1) doclen = 62
ID(304) TF(1) doclen = 44
ID(309) TF(7) doclen = 103
ID(2760) TF(1) doclen = 46
ID(4625) TF(1) doclen = 69
ID(6036) TF(1) doclen = 85
ID(6047) TF(1) doclen = 56
ID(6413) TF(1) doclen = 111
ID(7340) TF(4) doclen = 106
ID(7341) TF(2) doclen = 48
ID(7343) TF(3) doclen = 45
ID(7344) TF(5) doclen = 118
ID(7345) TF(2) doclen = 47
ID(7346) TF(1) doclen = 92
ID(7347) TF(1) doclen = 67
ID(7348) TF(2) doclen = 57
ID(7349) TF(2) doclen = 50
ID(7350) TF(4) doclen = 121
ID(7351) TF(4) doclen = 50
ID(7352) TF(2) doclen = 63
ID(7353) TF(3) doclen = 61
ID(7354) TF(1) doclen = 116
ID(7355) TF(3) doclen = 70
ID(7357) TF(3) doclen = 48
ID(7358) TF(3) doclen = 93
ID(7359) TF(1) doclen = 94
ID(7360) TF(3) doclen = 100
ID(7361) TF(6) doclen = 84
ID(7

### TF-IDF Index Search

Let's try to retrieve a keyword from our documents.

In [13]:
query = "How old is Beyoncé"

br = pt.BatchRetrieve(index, wmodel="TF_IDF")
br.search(query)

Unnamed: 0,qid,docid,docno,rank,score,query
0,1,8117,d8118,0,5.199493,How old is Beyoncé
1,1,5884,d5885,1,5.161569,How old is Beyoncé
2,1,1571,d1572,2,5.141297,How old is Beyoncé
3,1,3552,d3553,3,5.013696,How old is Beyoncé
4,1,1570,d1571,4,5.005569,How old is Beyoncé
...,...,...,...,...,...,...
599,1,9971,d9972,599,1.529740,How old is Beyoncé
600,1,18742,d18743,600,1.514365,How old is Beyoncé
601,1,215,d216,601,1.509308,How old is Beyoncé
602,1,10086,d10087,602,1.470040,How old is Beyoncé


In [16]:
most_relevant_result = contexts_df.loc[contexts_df['docno'] == 'd8118', 'text'].values[0]
print(most_relevant_result)

Old Low Franconian or Old Dutch is regarded as the primary stage in the development of a separate Dutch language. The "Low" in Old Low Franconian refers to the Frankish spoken in the Low Countries where it was not influenced by the High German consonant shift, as opposed to Central and high Franconian in Germany. The latter would as a consequence evolve with Allemanic into Old High German. At more or less the same time the Ingvaeonic nasal spirant law led to the development of Old Saxon, Old Frisian (Anglo-Frisian) and Old English (Anglo-Saxon). Hardly influenced by either development, Old Dutch remained close to the original language of the Franks, the people that would rule Europe for centuries. The language however, did experienced developments on its own, like final-obstruent devoicing in a very early stage. In fact, by judging from the find at Bergakker, it would seem that the language already experienced this characteristic during the Old Frankish period.


As expected this kind of search methods isn't able to properly answers questions because they search only for similar words without considering semantic.

Let's try with a more simple example.

In [23]:
query = "chopin music"

br.search(query)

Unnamed: 0,qid,docid,docno,rank,score,query
0,1,107,d108,0,11.815478,chopin music
1,1,147,d148,1,11.806783,chopin music
2,1,146,d147,2,11.707881,chopin music
3,1,103,d104,3,11.377966,chopin music
4,1,143,d144,4,10.976775,chopin music
...,...,...,...,...,...,...
876,1,6098,d6099,876,1.514049,chopin music
877,1,12536,d12537,877,1.405344,chopin music
878,1,11602,d11603,878,1.367929,chopin music
879,1,6119,d6120,879,1.290618,chopin music


In [24]:
most_relevant_result = contexts_df.loc[contexts_df['docno'] == 'd108', 'text'].values[0]
print(most_relevant_result)

Chopin's music remains very popular and is regularly performed, recorded and broadcast worldwide. The world's oldest monographic music competition, the International Chopin Piano Competition, founded in 1927, is held every five years in Warsaw. The Fryderyk Chopin Institute of Poland lists on its website over eighty societies world-wide devoted to the composer and his music. The Institute site also lists nearly 1,500 performances of Chopin works on YouTube as of January 2014.


We can observe that using search in the proper way leads to more coherent results.

Now we try with queries made by multiple keywords.

In [25]:
queries = pd.DataFrame([["query1", "child"], ["query2", "labour"], ["query3", "USA"]], columns=["qid", "query"])
br(queries)

Unnamed: 0,qid,docid,docno,rank,score,query
0,query1,11240,d11241,0,6.604772,child
1,query1,11228,d11229,1,6.565335,child
2,query1,11229,d11230,2,6.536568,child
3,query1,11243,d11244,3,6.493888,child
4,query1,11239,d11240,4,6.479785,child
...,...,...,...,...,...,...
470,query3,7363,d7364,88,2.961609,USA
471,query3,18345,d18346,89,2.924039,USA
472,query3,657,d658,90,2.364202,USA
473,query3,13714,d13715,91,2.293626,USA


In [26]:
most_relevant_result = contexts_df.loc[contexts_df['docno'] == 'd11241', 'text'].values[0]
print(most_relevant_result)

Accurate present day child labour information is difficult to obtain because of disagreements between data sources as to what constitutes child labour. In some countries, government policy contributes to this difficulty. For example, the overall extent of child labour in China is unclear due to the government categorizing child labour data as “highly secret”. China has enacted regulations to prevent child labour; still, the practice of child labour is reported to be a persistent problem within China, generally in agriculture and low-skill service sectors as well as small workshops and manufacturing enterprises.
In 2014, the U.S. Department of Labor issued a List of Goods Produced by Child Labor or Forced Labor where China was attributed 12 goods the majority of which were produced by both underage children and indentured labourers. The report listed electronics, garments, toys and coal among other goods.


Also in this case results are postive.

###BM25 Index Search

Now let's perform the same type of search as before using BM25 insted of TF-IDF to see if we obtain different results.

In [27]:
query = "How old is Beyoncé"

br = pt.BatchRetrieve(index, wmodel="BM25")
br.search(query)

Unnamed: 0,qid,docid,docno,rank,score,query
0,1,8117,d8118,0,9.356404,How old is Beyoncé
1,1,5884,d5885,1,9.288160,How old is Beyoncé
2,1,1571,d1572,2,9.251682,How old is Beyoncé
3,1,3552,d3553,3,9.022065,How old is Beyoncé
4,1,1570,d1571,4,9.007442,How old is Beyoncé
...,...,...,...,...,...,...
599,1,9971,d9972,599,2.752742,How old is Beyoncé
600,1,18742,d18743,600,2.725075,How old is Beyoncé
601,1,215,d216,601,2.715976,How old is Beyoncé
602,1,10086,d10087,602,2.645314,How old is Beyoncé


In [28]:
most_relevant_result = contexts_df.loc[contexts_df['docno'] == 'd8118', 'text'].values[0]
print(most_relevant_result)

Old Low Franconian or Old Dutch is regarded as the primary stage in the development of a separate Dutch language. The "Low" in Old Low Franconian refers to the Frankish spoken in the Low Countries where it was not influenced by the High German consonant shift, as opposed to Central and high Franconian in Germany. The latter would as a consequence evolve with Allemanic into Old High German. At more or less the same time the Ingvaeonic nasal spirant law led to the development of Old Saxon, Old Frisian (Anglo-Frisian) and Old English (Anglo-Saxon). Hardly influenced by either development, Old Dutch remained close to the original language of the Franks, the people that would rule Europe for centuries. The language however, did experienced developments on its own, like final-obstruent devoicing in a very early stage. In fact, by judging from the find at Bergakker, it would seem that the language already experienced this characteristic during the Old Frankish period.


In [29]:
query = "chopin music"

br.search(query)

Unnamed: 0,qid,docid,docno,rank,score,query
0,1,107,d108,0,21.409718,chopin music
1,1,147,d148,1,21.400160,chopin music
2,1,146,d147,2,21.232845,chopin music
3,1,103,d104,3,20.642463,chopin music
4,1,143,d144,4,19.891899,chopin music
...,...,...,...,...,...,...
876,1,6098,d6099,876,2.698015,chopin music
877,1,12536,d12537,877,2.504304,chopin music
878,1,11602,d11603,878,2.437631,chopin music
879,1,6119,d6120,879,2.299864,chopin music


In [31]:
most_relevant_result = contexts_df.loc[contexts_df['docno'] == 'd108', 'text'].values[0]
print(most_relevant_result)

Chopin's music remains very popular and is regularly performed, recorded and broadcast worldwide. The world's oldest monographic music competition, the International Chopin Piano Competition, founded in 1927, is held every five years in Warsaw. The Fryderyk Chopin Institute of Poland lists on its website over eighty societies world-wide devoted to the composer and his music. The Institute site also lists nearly 1,500 performances of Chopin works on YouTube as of January 2014.


In [32]:
queries = pd.DataFrame([["query1", "child"], ["query2", "labour"], ["query3", "USA"]], columns=["qid", "query"])
br(queries)

Unnamed: 0,qid,docid,docno,rank,score,query
0,query1,11240,d11241,0,12.048337,child
1,query1,11228,d11229,1,11.976396,child
2,query1,11229,d11230,2,11.923920,child
3,query1,11243,d11244,3,11.846063,child
4,query1,11239,d11240,4,11.820337,child
...,...,...,...,...,...,...
470,query3,7363,d7364,88,5.414216,USA
471,query3,18345,d18346,89,5.345533,USA
472,query3,657,d658,90,4.322076,USA
473,query3,13714,d13715,91,4.193054,USA


In [33]:
most_relevant_result = contexts_df.loc[contexts_df['docno'] == 'd11241', 'text'].values[0]
print(most_relevant_result)

Accurate present day child labour information is difficult to obtain because of disagreements between data sources as to what constitutes child labour. In some countries, government policy contributes to this difficulty. For example, the overall extent of child labour in China is unclear due to the government categorizing child labour data as “highly secret”. China has enacted regulations to prevent child labour; still, the practice of child labour is reported to be a persistent problem within China, generally in agriculture and low-skill service sectors as well as small workshops and manufacturing enterprises.
In 2014, the U.S. Department of Labor issued a List of Goods Produced by Child Labor or Forced Labor where China was attributed 12 goods the majority of which were produced by both underage children and indentured labourers. The report listed electronics, garments, toys and coal among other goods.


Despite the fact that the assigned scores to the documents are different with respect to TF-IDF ones we obtain as most relevant document the same result as befor for all three our examples.